They discovered that all four models had biased performances, providing less accurate diagnoses for patients in specific groups based on self-reported race, gender, and age. For example, the models struggled to differentiate lung cancer subtypes in African American and male patients, and breast cancer subtypes in younger patients. The models also had trouble detecting breast, renal, thyroid, and stomach cancer in certain demographic groups. These performance disparities occurred in around 29 percent of the diagnostic tasks that the models conducted.

This diagnostic inaccuracy, Yu said, happens because these models extract demographic information from the slides — and rely on demographic-specific patterns to make a diagnosis.

The results were unexpected “because we would expect pathology evaluation to be objective,” Yu added. “When evaluating images, we don’t necessarily need to know a patient’s demographics to make a diagnosis.”

The team wondered: Why didn’t pathology AI show the same objectivity?

Searching for explanations

The researchers landed on three explanations.

Because it is easier to get samples for patients in certain demographic groups, the AI models are trained on unequal sample sizes. As a result, the models have a harder time making an accurate diagnosis in samples that aren’t well-represented in the training set, such as those from minority groups based on race, age, or gender.

Yet “the problem turned out to be much deeper than that,” Yu said. The researchers noticed that sometimes the models performed worse in one demographic group, even when the sample sizes were comparable.

Additional analyses revealed that this may be because of differential disease incidence: Some cancers are more common in certain groups, so the models become better at making a diagnosis in those groups. As a result, the models may have difficulty diagnosing cancers in populations where they aren’t as common.

The AI models also pick up on subtle molecular differences in samples from different demographic groups. For example, the models may detect mutations in cancer driver genes and use them as a proxy for cancer type — and thus be less effective at making a diagnosis in populations in which these mutations are less common.

“We found that because AI is so powerful, it can differentiate many obscure biological signals that cannot be detected by standard human evaluation,” Yu said.

As a result, the models may learn signals that are more related to demographics than disease. That, in turn, could affect their diagnostic ability across groups.

Together, Yu said, these explanations suggest that bias in pathology AI stems not only from the variable quality of the training data but also from how researchers train the models.

Finding a fix

After assessing the scope and sources of the bias, Yu and his team wanted to fix it.

The researchers developed FAIR-Path, a simple framework based on an existing machine-learning concept called contrastive learning. Contrastive learning involves adding an element to AI training that teaches the model to emphasize the differences between essential categories — in this case, cancer types — and to downplay the differences between less crucial categories — here, demographic groups.

When the researchers applied the FAIR-Path framework to the models they’d tested, it reduced the diagnostic disparities by around 88 percent.

“We show that by making this small adjustment, the models can learn robust features that make them more generalizable and fairer across different populations,” Yu said.

The finding is encouraging, he added, because it suggests that bias can be reduced even without training the models on completely fair, representative data.

Next, Yu and his team are collaborating with institutions around the world to investigate the extent of bias in pathology AI in places with different demographics and clinical and pathology practices. They are also exploring ways to extend FAIR-Path to settings with limited sample sizes. Additionally, they would like to investigate how bias in AI contributes to demographic discrepancies in health care and patient outcomes.

Ultimately, Yu said, the goal is to create fair, unbiased pathology AI models that can improve cancer care by helping human pathologists quickly and accurately make a diagnosis.

“I think there’s hope that if we are more aware of and careful about how we design AI systems, we can build models that perform well in every population,” he said.

The future of federally funded research at Harvard Medical School — supported by taxpayers and done in service to humanity — remains uncertain. Learn more.