by
Gus Iversen, Editor in Chief | April 08, 2025
A new study from the Icahn School of Medicine at Mount Sinai has found that large language models (LLMs) used in health care may generate different treatment recommendations for identical medical cases based on patients’ socioeconomic or demographic characteristics.
Published April 7 in
Nature Medicine, the research evaluated nine generative AI models across 1,000 simulated emergency department scenarios. Each case was tested with 32 variations in patient background, leading to more than 1.7 million AI-generated clinical decisions. Despite identical clinical presentations, researchers found that AI output frequently shifted depending on a patient's income level, race, or gender—raising concerns about fairness and clinical safety.
In some cases, models recommended mental health evaluations or advanced diagnostic testing such as CT scans more often for high-income patients, while suggesting minimal or no testing for lower-income individuals. These patterns extended across key clinical decision areas, including triage level, diagnostic workup, and treatment pathways.

Ad Statistics
Times Displayed: 108739
Times Visited: 6628 MIT labs, experts in Multi-Vendor component level repair of: MRI Coils, RF amplifiers, Gradient Amplifiers Contrast Media Injectors. System repairs, sub-assembly repairs, component level repairs, refurbish/calibrate. info@mitlabsusa.com/+1 (305) 470-8013
“Our research provides a framework for AI assurance, helping developers and health care institutions design fair and reliable AI tools,” said Dr. Eyal Klang, co-senior author and chief of generative AI in Mount Sinai’s Windreich Department of Artificial Intelligence and Human Health.
Lead author Dr. Mahmud Omar emphasized the need for continued oversight. “As AI becomes more integrated into clinical care, it’s essential to thoroughly evaluate its safety, reliability, and fairness,” he said. “By identifying where these models may introduce bias, we can work to refine their design and strengthen oversight.”
The authors caution that the study offers only a snapshot of current model behavior and that performance may evolve as prompting strategies and model architectures change. Future work will explore multistep clinical interactions and deploy AI systems in real-world hospital settings to evaluate their impact on patient care.
The team plans to collaborate with other health systems to develop policies guiding the ethical and equitable use of AI in medicine.