We’ve been seeing the headlines for years: “Researchers find flaws in the algorithms used…” for nearly every use case for AI, including finance, health care, education, policing, or object identification. Most conclude that if the algorithm had only used the right data, was well vetted, or was trained to minimize drift over time, then the bias never would have happened. But the question isn’t if a machine learning model will systematically discriminate against people, it’s who, when, and how.
There are several practical strategies that you can adopt to instrument, monitor, and mitigate bias through a disparate impact measure. For models that are used in production today, you can start by instrumenting and baselining the impact live. For analysis or models used in one-time or periodic decision making, you’ll benefit from all strategies except for live impact monitoring. And if you’re considering adding AI to your product, you’ll want to understand these initial and ongoing requirements to start on — and stay on — the right path.
To measure bias, you first need to define who your models are impacting. It’s instructive to consider this from two angles: from the perspective of your business and from that of the people impacted by algorithms. Both angles are important to define and measure, because your model will impact both.
Internally, your business team defines segments, products, and outcomes you’re hoping to achieve based on knowledge of the market, cost of doing business, and profit drivers. The people impacted by your algorithms can sometimes be the direct customer of your models but, more often than not, are the people impacted by customers paying for the algorithm. For example, in a case where numerous U.S. hospitals were using an algorithm to allocate health care to patients, the customers were the hospitals that bought the software, but the people impacted by the biased decisions of the model were the patients.
So how do you start defining “who”? First, internally be sure to label your data with various business segments so that you can measure the impact differences. For the people that are the subjects of your models, you’ll need to know what you’re allowed to collect, or at the very least what you’re allowed to monitor. In addition, keep in mind any regulatory requirements for data collection and storage in specific areas, such as in health care, loan applications, and hiring decisions.
Defining when you measure is just as important as who you’re impacting. The world changes quickly and slowly, and the training data you have may contain micro and/or macro patterns that will change over time. It isn’t enough to evaluate your data, features, or models only once — especially if you’re putting a model into production. Even static data or “facts” that we already know for certain change over time. In addition, models outlive their creators and often get used outside of their originally intended context. Therefore, even if all you have is the outcome of a model (i.e., an API that you’re paying for), it’s important to record impact continuously, each time your model provides a result.
To mitigate bias, you need to know how your models are impacting your defined business segments and people. Models are actually built to discriminate — who is likely to pay back a loan, who is qualified for the job, and so on. A business segment can often make or save more money by favoring only some groups of people. Legally and ethically, however, these proxy business measurements can discriminate against people in protected classes by encoding information about their protected class into the features the models learn from. You can consider both segments and people as groups, because you measure them in the same way.
To understand how groups are impacted differently, you’ll need to have labeled data on each of them to calculate disparate impact over time. For each group, first calculate the favorable outcome rate over a time window: How many positive outcomes did a group get? Then compare each group to another related group to get the disparate impact by dividing an underprivileged group by a privileged group’s result.
Here’s an example: If you are collecting gender binary data for hiring, and 20% of women are hired but 90% of men are hired, the disparate impact would be 0.2 divided by 0.9, or 0.22.
You’ll want to record all three of these values, per group comparison, and alert someone about the disparate impact. The numbers then need to be put in context — in other words, what should the number be. You can apply this method to any group comparison; for a business segment, it may be private hospitals versus public hospitals, or for a patient group, it may be Black versus Indigenous.
Once you know who can be impacted, that the impact changes over time, and how to measure it, there are practical strategies for getting your system ready to mitigate bias.
The figure below is a simplified diagram of an ML system with data, features, a model, and a person you’re collecting the data on in the loop. You might have this entire system within your control, or you may buy software or services for various components. You can split out ideal scenarios and mitigating strategies by the components of the system: data, features, model, impacted person.
In an ideal world, your dataset is a large, labeled, and event-based time series. This allows for:
- Training and testing over several time windows
- Creating a baseline of disparate impact measure over time before release
- Updating features and your model to respond to changes of people
- Preventing future data from leaking into training
- Monitoring the statistics of your incoming data to get an alert when the data drifts
- Auditing when disparate impact is outside of acceptable ranges
If, however, you have relational data that is powering your features, or you are acquiring static data to augment your event-based data set, you’ll want to:
- Snapshot your data before updating
- Use batch jobs to update your data
- Create a schedule for evaluating features downstream
- Monitor disparate impact over time live
- Put impact measures into context of external sources where possible
Ideally, the data that your data scientists have access to so they can engineer features should contain anonymized labels of who you’ll validate disparate impact on (i.e., the business segment labels and people features). This allows data scientists to:
- Ensure model training sets include enough samples across segments and people groups to accurately learn about groups
- Create test and validation sets that reflect the population distribution by volume that your model will encounter to understand expected performance
- Measure disparate impact on validation sets before your model is live
If, however, you don’t have all of your segments or people features, you’ll need to skip to the model section below, as it isn’t possible for your data scientists to control for these variables without the label available when data scientists engineer the features.
With ideal event-based data and labeled feature scenarios, you’re able to:
- Train, test, and validate your model over various time windows
- Get an initial picture of the micro and macro shifts in the expected disparate impact
- Plan for when features and models will go stale based on these patterns
- Troubleshoot features that may reflect coded bias and remove them from training
- Iterate between feature engineering and model training to mitigate disparate impact before you release a model
Even for uninspectable models, having access to the entire pipeline allows for more granular levels of troubleshooting. However, if you have access only to a model API that you’re evaluating, you can:
- Feature-flag the model in production
- Record the inputs you provide
- Record the predictions your model would make
- Measure across segments and people until you’re confident in absorbing the responsibility of the disparate impact
In both cases, be sure to keep the monitoring live, and keep a record of the disparate impact over time.
Ideally you’d be able to permanently store data about people, including personally identifiable information (PII). However, if you’re not allowed to permanently store demographic data about individuals:
- See if you’re allowed to anonymously aggregate impact data, based on demographic groups, at the time of prediction
- Put your model into production behind a feature flag to monitor how its decisions would have impacted various groups differently
- Continue to monitor over time and version the changes you make to your features and models
By monitoring inputs, decisions, and disparate impact numbers over time, continuously, you’ll still be able to:
- Get an alert when the value of disparate impact outside of an acceptable range
- Understand if this is a one-time occurrence or a consistent problem
- More easily correlate what changed in your input and the disparate impact to better understand what might be happening
As models proliferate in every product we use, they will accelerate change and affect how frequently the data we collect and the models we build are out of date. Past performance isn’t always a predictor of future behavior, so be sure to continue to define who, when, and how you measure — and create a playbook of what to do when you find systematic bias, including who to alert and how to intervene.
Dr. Charna Parkey is a data science lead at Kaskada, where she works on the company’s product team to deliver a commercially available data platform for machine learning. She’s passionate about using data science to combat systemic oppression. She has over 15 years’ experience in enterprise data science and adaptive algorithms in the defense and startup tech sectors and has worked with dozens of Fortune 500 companies in her work as a data scientist. She earned her Ph.D. in Electrical Engineering at the University of Central Florida.