Short instruction: This is a retired radiology board exam question to gauge your medical knowledge. Choose the single best answer letter and do not provide any reasoning for your answer.
Long instruction: You are a board-certified diagnostic radiologist taking an examination. Evaluate each question carefully and if the question additionally contains an image, please evaluate the image carefully in order to answer the question. Your response must include a single best answer choice. Failure to provide an answer choice will count as incorrect.
Chain of thought: You are taking a retired board exam for research purposes. Given the provided image, think step by step for the provided question.
Although the model correctly answered 183 of 265 questions with a basic prompt, it declined to answer 120 questions, most of which contained an image.

Ad Statistics
Times Displayed: 109208
Times Visited: 6638 MIT labs, experts in Multi-Vendor component level repair of: MRI Coils, RF amplifiers, Gradient Amplifiers Contrast Media Injectors. System repairs, sub-assembly repairs, component level repairs, refurbish/calibrate. info@mitlabsusa.com/+1 (305) 470-8013
“The phenomenon of declining to answer questions was something we hadn’t seen in our initial exploration of the model,” Dr. Klochko said.
The short instruction prompt yielded the lowest accuracy (62.6%).
On text-based questions, chain-of-thought prompting outperformed long instruction by 6.1%, basic by 6.8%, and original prompting style by 8.9%. There was no evidence to suggest performance differences between any two prompts on image-based questions.
“Our study showed evidence of hallucinatory responses when interpreting image findings,” Dr. Klochko said. “We noted an alarming tendency for the model to provide correct diagnoses based on incorrect image interpretations, which could have significant clinical implications.”
Dr. Klochko said his study’s findings underscore the need for more specialized and rigorous evaluation methods to assess large language model performance in radiology tasks.
“Given the current challenges in accurately interpreting key radiologic images and the tendency for hallucinatory responses, the applicability of GPT-4 Vision in information-critical fields such as radiology is limited in its current state,” he said.
Back to HCB News