The lack of transparency is one of the factors that led the team to focus on explainable AI techniques for medicine and science. Most AI is regarded as a "black box" -- the model is trained on massive datasets and it spits out predictions without anyone knowing precisely how the model came up with a given result. With explainable AI, researchers and practitioners are able to understand, in detail, how various inputs and their weights contributed to a model's output.
The team used these same techniques to evaluate the trustworthiness of models recently touted for appearing to accurately identify cases of COVID-19 from chest X-rays. Despite a number of published papers heralding the results, the researchers suspected that something else may have been happening inside the black box that led to the models' predictions.

Ad Statistics
Times Displayed: 49670
Times Visited: 1409 Ampronix, a Top Master Distributor for Sony Medical, provides Sales, Service & Exchanges for Sony Surgical Displays, Printers, & More. Rely on Us for Expert Support Tailored to Your Needs. Email info@ampronix.com or Call 949-273-8000 for Premier Pricing.
Specifically, the team reasoned that these models would be prone to a condition known as "worst-case confounding," owing to the lack of training data available for such a new disease. This scenario increased the likelihood that the models would rely on shortcuts rather than learning the underlying pathology of the disease from the training data.
"Worst-case confounding is what allows an AI system to just learn to recognize datasets instead of learning any true disease pathology," said co-lead author Joseph Janizek, who is also a doctoral student in the Allen School and earning a medical degree at the UW. "It's what happens when all of the COVID-19 positive cases come from a single dataset while all of the negative cases are in another. And while researchers have come up with techniques to mitigate associations like this in cases where those associations are less severe, these techniques don't work in situations where you have a perfect association between an outcome such as COVID-19 status and a factor like the data source."
The team trained multiple deep convolutional neural networks on X-ray images from a dataset that replicated the approach used in the published papers. First they tested each model's performance on an internal set of images from that initial dataset that had been withheld from the training data. Then the researchers tested how well the models performed on a second, external dataset meant to represent new hospital systems.
While the models maintained their high performance when tested on images from the internal dataset, their accuracy was reduced by half on the second set. The researchers referred to this as a "generalization gap" and cited it as strong evidence that confounding factors were responsible for the models' predictive success on the initial dataset.