
New research led by a team at the Icahn School of Medicine at Mount Sinai in New York makes a troubling case for continued vigilance against the threat of “deepfake” medical evidence.
The researchers tested a group of volunteers, 17 practicing radiologists from six countries, requiring them to distinguish real X-rays from AI-generated simulacra in a pool of 264 unique images. The results did not inspire confidence.
“Our study shows that these deepfake X-rays are realistic enough to deceive radiologists, the most highly trained medical image experts,” study lead author Dr. Mikael Tordjman, an MD and post-doctoral fellow at the Icahn School, said in a press statement, “even when they knew the AI-generated images existed.”
In a later test, the AI mockups also fooled one of the same multimodal large language models that was used to create them: OpenAI’s ChatGPT-4O.
sound of fraud
Tordjman pursued this project out of genuine concern for the risks to patients, doctors, and countless other innocent bystanders. Reliable AI-generated medical imagery, he said, “creates a high-risk vulnerability to fraudulent litigation where, for example, a fabricated fracture may be indistinguishable from a real one.” The issue has already attracted the attention of legal experts who are seeking to protect juries from being tainted by exposure to similar AI frauds.
“There is also a significant cybersecurity risk if hackers were to gain access to hospital networks and inject synthetic images to manipulate patient diagnoses or cause widespread clinical chaos,” Tordjman said.
The 17 volunteer radiologists Tordjman’s team tested for the study, published Tuesday in the journal Radiology, were exposed to two different datasets. The first asked volunteers to view 154 static X-rays, half real radiographs and half Chat GPT-4O-generated forgeries (77 each). The second trial used a special diffusion model AI trained to produce reliable chest radiographs, called roentgens, which show organs such as the heart and lungs; Volunteers were asked to sort a dataset of 110 images, 55 real and 55 fake.
Radiologists who were exposed to the fact that these datasets contained AI images performed better than those exposed to the images without indication of the real purpose of the test, but still not good. These volunteers showed an average accuracy of 75%, compared to only 41% accuracy for the latter group.
The study’s 17 individual radiologists, whose depth of professional experience varied (zero to 40 years on the job), ranged from 58% to 92% on ChatGPT-generated images and from 62% to 78% on Roentgen-generated chest X-rays. Age and experience do not seem to be a factor in their accuracy, but, for some reason, musculoskeletal radiologists have proven to be significantly better at detecting fakes than other subspecialists.
A game for earning a living (and chatbots)
Tordjman and his team also ran their tests on four multimodal LLMs, ChatGPT-4O and 5, Google’s Gemini 2.5 Pro, and Meta’s Llama 4 Maverick. The bots performed worse than humans on the mocks created by GPT-4O (a particularly embarrassing performance for ChatGPT-4O), with accuracy ranging from about 57% to 85%.
When it came to Roentgen’s synthetic chest X-rays, LLM’s fake detection accuracy varied slightly more widely, from 52% to 89%.
Tordjman said he hopes these findings will be used in the future to establish educational datasets and identification tools. “Deepfake medical images often look very accurate,” he said. “The bones are exceedingly smooth, the spine unnaturally straight, the lungs highly symmetrical, the patterns of blood vessels exceedingly uniform, and the fractures appear unusually clean and consistent.”
You can take a version of the test yourself here. But don’t beat yourself up for a bad score. As someone who knows a lot about fraudsters and self-deception once said, “Life is one long failure of understanding.”
<a href