by
Gus Iversen, Editor in Chief | May 20, 2025
A new study from researchers at the Massachusetts Institute of Technology raises concerns about the use of vision-language models (VLMs) in high-stakes environments such as clinical diagnostics, where misunderstanding a single word — particularly a negation like “no” or “not” — can have serious consequences.
The study found that leading VLMs often fail to account for negation in image-captioning tasks, sometimes performing no better than random guessing. In medical scenarios, this failure could lead to incorrect image retrievals and potentially flawed diagnostic support.
For instance, a model might retrieve records of patients with an enlarged heart even when queried specifically for cases without that feature; a distinction that can significantly alter a diagnosis.

Ad Statistics
Times Displayed: 119127
Times Visited: 6889 MIT labs, experts in Multi-Vendor component level repair of: MRI Coils, RF amplifiers, Gradient Amplifiers Contrast Media Injectors. System repairs, sub-assembly repairs, component level repairs, refurbish/calibrate. info@mitlabsusa.com/+1 (305) 470-8013
"Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences," said lead author Kumail Alhamoud, a graduate student at MIT.
To evaluate the issue, the researchers developed two benchmark tasks: one tested image retrieval accuracy when negated descriptors were added to captions, and another involved multiple-choice questions with subtle variations in caption wording. VLM performance dropped by roughly 25% on retrieval tasks involving negation, and the highest-scoring model achieved only 39% accuracy on the multiple-choice questions — with several models faring worse than random.
The study attributes this weakness in part to what the researchers describe as “affirmation bias”, the tendency of VLMs to ignore negated terms and focus on positive object labels, a behavior rooted in how these models are trained.
To mitigate this, the team created a new data set using large language models to generate captions with natural-sounding negations. Fine-tuning VLMs with these augmented data sets led to a 10% improvement in retrieval tasks and a 30% boost in multiple-choice performance.
Still, the authors acknowledge that the solution is preliminary. “We haven’t even touched how these models work, but we hope this is a signal that this is a solvable problem,” Alhamoud said.
The research, coauthored with collaborators from OpenAI, Oxford University, and MIT’s Computer Science and Artificial Intelligence Laboratory, will be presented at the Conference on Computer Vision and Pattern Recognition.