Study highlights reliability concerns with vision-language models in medical settings

May 20, 2025

by Gus Iversen, Editor in Chief

A new study from researchers at the Massachusetts Institute of Technology raises concerns about the use of vision-language models (VLMs) in high-stakes environments such as clinical diagnostics, where misunderstanding a single word — particularly a negation like “no” or “not” — can have serious consequences.

The study found that leading VLMs often fail to account for negation in image-captioning tasks, sometimes performing no better than random guessing. In medical scenarios, this failure could lead to incorrect image retrievals and potentially flawed diagnostic support.

For instance, a model might retrieve records of patients with an enlarged heart even when queried specifically for cases without that feature; a distinction that can significantly alter a diagnosis.

"Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences," said lead author Kumail Alhamoud, a graduate student at MIT.

To evaluate the issue, the researchers developed two benchmark tasks: one tested image retrieval accuracy when negated descriptors were added to captions, and another involved multiple-choice questions with subtle variations in caption wording. VLM performance dropped by roughly 25% on retrieval tasks involving negation, and the highest-scoring model achieved only 39% accuracy on the multiple-choice questions — with several models faring worse than random.

The study attributes this weakness in part to what the researchers describe as “affirmation bias”, the tendency of VLMs to ignore negated terms and focus on positive object labels, a behavior rooted in how these models are trained.

To mitigate this, the team created a new data set using large language models to generate captions with natural-sounding negations. Fine-tuning VLMs with these augmented data sets led to a 10% improvement in retrieval tasks and a 30% boost in multiple-choice performance.

Still, the authors acknowledge that the solution is preliminary. “We haven’t even touched how these models work, but we hope this is a signal that this is a solvable problem,” Alhamoud said.

The research, coauthored with collaborators from OpenAI, Oxford University, and MIT’s Computer Science and Artificial Intelligence Laboratory, will be presented at the Conference on Computer Vision and Pattern Recognition.