When AI Radiologist Get Confused: The Critical Challenge of VLM Robustness in Medical Diagnostics

Picture this: You’re in the emergency room with chest pain and shortness of breath. The doctor orders a chest X-ray, and while waiting for the radiologist, you pull out your phone. Could ChatGPT help interpret what’s wrong? You’ve used it for math problems and recipe suggestions. Surely it could read an X-ray?

This isn’t a hypothetical anymore. We’re already there. An Australian study found that 9.9% of adults had used ChatGPT for health questions in just six months, with nearly 40% of non-users considering it. In another survey, over 21% were turning to AI chatbots for health information. And it’s not just patients. A Wolters Kluwer survey found that 40% of U.S. physicians are ready to use generative AI when treating patients, with 68% believing it would help their practice. What’s fascinating is who’s using these tools most. The Australian data revealed that people with limited health literacy use ChatGPT at nearly twice the rate of others (18.4% vs 9.4%). Those from non-English speaking backgrounds? Even higher at 29.2%. And here’s what really caught our attention in that study: when people get health advice from ChatGPT, nearly half simply follow it. No questions asked, no double-checking with their doctor.

These numbers are only going up. Users love that ChatGPT is always available and costs nothing, two features that topped their list of reasons for using it. You can’t see a doctor at 3 AM on a Sunday, but ChatGPT never sleeps. Google’s making it even easier: AI overviews now pop up automatically on google searches. No special app needed, no extra clicks. It’s just there.

Clinicians are jumping in too. Already 27% of pediatric providers use these tools for clinical documentation. Why? Because background listening and study summarization pilots are cutting their note-writing time in half, sometimes more. That’s hours back in their day. And the money flooding in tells its own story: funding for generative AI health startups exploded from $1.7 billion in 2022 to $14 billion in 2023. By 2030? Projected to hit $109 billion (VSP Futurist Series).

Our recent research reveals that when these sophisticated AI systems move from answering text questions to interpreting medical images, they become dangerously brittle. We evaluated 125 chest X-ray interpretations using state-of-the-art vision language models, including Google’s MedGemma 4B and GPT-4V. Simply changing “vascular dilation” to “vascular congestion” in a question made the same AI system provide completely different diagnoses for the same X-ray image.

On a scale of 100, measuring factual accuracy, completeness, and clinical relevance, we found performance scores ranging from 7 to 97 across different phrasings of the same chest X-ray questions. That’s not just inconsistent. When people are already following AI medical advice without verification, that level of variability is genuinely dangerous.

The momentum behind medical AI is undeniable. Just this week, Mayo Clinic Press sent out an email titled “Health tips: AI in healthcare,” highlighting how AI is already being used for stroke diagnosis, heart failure detection, and cancer screening. They describe an optimistic future where “AI has the potential to improve the work of human healthcare teams, making care more personal and effective.” While this enthusiasm is understandable given AI’s promise, our research suggests we need to address fundamental robustness issues before these systems can truly deliver on that potential.

The Promise of VLM Based Radiologist

When we first started working with vision language models for medical imaging, the promise seemed clear. Unlike traditional AI that just spits out labels like “pneumonia: 87% probability,” these models could actually explain what they saw. They’d tell you why they thought something looked abnormal. You could ask follow-up questions. Feed them a patient’s history and watch them adjust their interpretation accordingly. For chest X-rays, which make up nearly half of all medical imaging worldwide, this felt revolutionary.

But here’s what we discovered matters just as much as accuracy: consistency. We call it robustness in the lab, but what it really means is whether the AI gives you the same answer when you ask the same question slightly differently. Think about it. A radiologist doesn’t suddenly see pneumonia just because you say “chest radiograph” instead of “X-ray.” They know “lung volumes” and “lung capacity” mean the same thing. The concerning mass in the upper left doesn’t vanish when you phrase your question differently.

Yet that’s exactly what happens with today’s most advanced models. And we’re not talking about edge cases or trick questions. We tested basic medical synonyms, the kind any first-year resident would recognize as identical. The models fell apart. What surprised us most wasn’t that they made mistakes. It was how confident they remained while giving completely contradictory diagnoses for the same image. This isn’t just a technical problem anymore. It’s a safety crisis hiding in plain sight.

When Terminology Becomes a Diagnostic Trap

Let’s walk through three real examples from our evaluation that show how catastrophically these models can fail.

Case 1: The Vascular Confusion

We showed a chest X-ray to one of the most advanced vision language models available. Asked about vascular findings. The model correctly identified pulmonary vascular dilation, which is exactly what we’d expect. It’s a widening of blood vessels that might indicate various conditions but isn’t immediately life threatening.

Then we changed two words. Just two. “Vascular dilation” became “vascular congestion.”

Suddenly the model was talking about cardiac congestion. Possible heart failure. Recommending completely different follow-up procedures. Same image, nearly identical question, completely different medical pathway. The clinical implications hit us immediately. A patient might get rushed into unnecessary cardiac workup while their actual condition goes untreated. Or worse, someone might start urgent cardiac treatment for what’s actually a non-cardiac issue.

Case 2: The Imaginary Pneumonia

This one still makes us shake our heads. We had an X-ray showing clear pleural effusion. That’s fluid around the lungs, often serious enough to need drainage. The model saw it correctly when we asked about lung findings.

But when we added the phrase “chest radiograph” to our question? The model suddenly “saw” pneumonia that wasn’t there.

It didn’t just add pneumonia to its diagnosis. It completely forgot about the pleural effusion and started recommending antibiotics. This isn’t just wrong. It’s actively harmful. A patient with fluid crushing their lungs needs drainage, not antibiotics for an infection that doesn’t exist.

Case 3: The Vanishing Diagnosis

Perhaps most concerning was when changing “lung volumes” to “lung capacity” made critical findings disappear entirely. The model went from correctly identifying pleural effusion and potential cardiac issues to completely missing the effusion and focusing only on cardiac problems.

Pleural effusion can kill you if it’s not treated. It can lead to respiratory failure. Yet a simple synonym made the AI blind to its presence. The model confidently described other findings while missing the one thing that might send someone to the ICU.

What makes these failures so unsettling is their unpredictability. You can’t train staff to avoid certain phrases or create a list of “safe” terminology. The brittleness runs deeper than that. And remember, these aren’t adversarial attacks or clever exploits. These are normal medical conversations using standard terminology. The kind of variations that happen naturally when a tired resident talks to an AI at 3 AM, or when a patient describes their symptoms in their own words.

Making Sense of the Brittleness: What We’re Learning in the Lab

The brittleness we observed in chest X-ray VLMs sent us down a research rabbit hole. How could models that seem so sophisticated fail so spectacularly when we barely changed our words? We needed a systematic way to measure this vulnerability, which led us to develop VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) here at the SAIL Lab at University of New Haven.

VSF-Med isn’t just another benchmark. It’s our attempt to quantify exactly how and why these models break. We evaluated 68,478 attack scenarios across five models including CheXagent 8B, Llama 3.2 11B Vision, and GPT-4o. The framework measures vulnerability across nine different attack vectors, giving us concrete numbers for what we’d been observing anecdotally.

What we found was sobering. Even CheXagent 8B, a model specifically trained for medical imaging, showed moderate vulnerability (z-score: 0.68) to our prompt injection attacks. That “dilation vs congestion” problem we showed you earlier? Not an isolated incident. Medical-specialized models demonstrated only 36% better resilience compared to general-purpose VLMs. Better, yes. Good enough for clinical use? Not even close.

So why are these models so fragile? We have some theories we’re exploring. Many vision-language models use contrastive learning during training, where they learn to match images with text descriptions. This approach might create brittle associations between specific phrases and visual features. When you change the phrasing, you might be triggering completely different learned pathways in the model. It’s like the model memorized specific image-text pairs rather than truly understanding the underlying medical concepts.

There’s also the alignment problem. These models are often fine-tuned to be helpful and responsive, which might make them overeager to provide different answers when prompted differently. They’re trying so hard to be useful that they forget to be consistent. And medical imaging adds another layer of complexity. Radiological language is precise but also full of synonyms and alternative phrasings. Models trained on general text might not grasp that “increased opacity” and “increased density” mean the same thing in a chest X-ray context.

We’re also investigating whether the problem runs deeper than training. The transformer architecture itself might contribute to this sensitivity. Attention mechanisms that work beautifully for general language tasks might amplify small prompt variations when dealing with the high-stakes precision required in medical diagnosis.

But here’s the thing: we don’t have all the answers yet. This brittleness might stem from fundamental limitations in how current AI systems process multimodal information. Or it could be something we can fix with better training approaches. That’s what keeps us coming back to the lab each day.

Our VSF-Med framework is completely open source because we believe this problem is too important for any single team to tackle alone. We’ve made it so researchers anywhere can benchmark their medical VLM with a single command, generating over 30,000 adversarial test cases automatically. The data shows that current best-in-class models still have vulnerability spreads exceeding 0.3 standard deviations. Until we can get that number much, much lower, these systems remain too fragile for autonomous clinical use.

What’s next? We’re diving deeper into architectural modifications that might improve robustness. We’re exploring whether different training objectives could create more stable image-text associations. And we’re working with clinicians to understand which types of brittleness pose the greatest real-world risks. Because ultimately, this isn’t just an interesting technical puzzle. It’s about making sure AI tools genuinely help rather than harm when lives are on the line.

Citation: Please use the below citation to cite our work.

@article{sadanandan2025vsf, title={VSF-Med: A Vulnerability Scoring Framework for Medical Vision-Language Models}, author={Sadanandan, Binesh and Behzadan, Vahid}, journal={arXiv preprint arXiv:2507.00052}, year={2025}