Key Takeaways
- A 2026 BMJ Open audit of 5 popular AI chatbots found 49.6% of health responses were problematic, with 19.6% rated highly problematic.[1]
- Grok generated the most highly problematic responses (58%), while Gemini generated the fewest.[2]
- Citation quality was poor across all models — average reference completeness was just 40%, and no chatbot provided a fully accurate reference list.[1]
- A separate Nature-style study found hallucination rates of 50–82% when chatbots encountered fabricated clinical details, even with safety prompts.[3]
- Open-ended health questions produced 32% highly problematic responses; closed yes/no questions produced only 7.2%.[1]
The Trust Gap Between Confidence and Accuracy
Millions of people now type health questions into ChatGPT, Gemini, Grok, Meta AI, and Claude every day. The replies arrive in seconds — articulate, organized, and confident. They look like the kind of explanation a thoughtful clinician might give.
That confidence isn't matched by the accuracy of the answers. A new audit published in BMJ Open in April 2026 tested five of the most popular chatbots against 250 health questions and found that nearly half of the responses were problematic.[1] Some answers were partly wrong. Others were dangerously wrong. Almost all of them sounded equally certain.
This article translates that research into rules you can actually use. The goal isn't to scare you away from AI tools — it's to help you tell the difference between when they're useful and when they aren't.
What the BMJ Study Actually Found
The audit covered five chatbots: ChatGPT, Gemini, Grok, Meta AI, and Claude. Researchers asked each one 50 health questions across five topic areas — cancer, vaccines, stem cell therapies, nutrition, and athletic performance.[1] The questions were a mix of open-ended ("How should I treat my sore throat?") and closed yes/no formats.
Out of 250 total responses, 49.6% were rated problematic by independent reviewers. Of those, 30% were "somewhat problematic" and 19.6% were "highly problematic" — meaning they contained outright misinformation or could plausibly cause harm if acted on.[3]
Performance varied by model. Grok produced the highest rate of highly problematic responses at 58%. ChatGPT came in at 52%, Meta AI at 50%, and Gemini had the lowest rate of the group.[2] Claude landed somewhere in between. None of the chatbots were close to error-free.
The topic mattered too. Vaccines and cancer — areas where the medical evidence is well-established and the scientific consensus is strong — produced the most reliable answers. Stem cell therapies, athletic performance, and nutrition produced the worst.[3] Those are exactly the areas where pseudoscience and marketing hype crowd the training data.
One more striking finding: out of 250 questions, the chatbots refused to answer only twice. The rest of the time, they answered with confidence — even on topics where a careful clinician would have expressed uncertainty or recommended seeing a doctor.[1]
The Three Mechanisms Behind AI Errors
To use these tools safely, it helps to understand why they fail. The errors aren't random glitches that the next software update will fix. They are built into how large language models work.
Pattern matching, not reasoning. Chatbots don't evaluate evidence the way a clinician does. They generate text by predicting which word is statistically likely to come next, based on the patterns in their training data.[4] If a question pattern looks similar to one they have seen before, the answer will sound plausible whether or not the underlying medical facts are correct.
Sycophancy. Modern chatbots are fine-tuned on human feedback, and one of the things human raters tend to reward is agreement. As a result, the models lean toward telling you what you seem to want to hear.[4] If you ask a question that contains a wrong assumption — for example, "Why does my high-dose vitamin C protect me from cancer?" — a chatbot may reinforce the assumption rather than correct it.
Citation hallucination. The BMJ audit found that average reference completeness across all five chatbots was only 40%, and not a single model produced a fully accurate reference list.[1] A separate analysis of more than 500 AI-generated citations from ChatGPT and similar tools found that only 32% were fully accurate, with nearly half at least partly fabricated.[7] Many references look real — author names, journal titles, page numbers — but point to papers that don't exist or don't say what the chatbot claims.
Eight Practical Safety Rules
Synthesizing the findings across several 2025–2026 studies, the following rules give you a working framework for using AI chatbots without getting burned.
- Never use AI for emergency triage. A Mount Sinai study found ChatGPT Health under-triaged 52% of genuine emergencies, often steering people away from urgent care when they needed it most.[5] If something feels urgent, call 911 or go to an ER — don't ask a chatbot.
- Treat the answer like a Wikipedia entry, not a doctor visit. AI responses are useful for orientation and vocabulary. They are not useful for decisions about your specific care.
- Verify every citation. Search the actual journal database (PubMed, the publisher's site) before trusting a reference. The BMJ audit found roughly 60% of references were incomplete, and other work has shown nearly half are partly fabricated.[1][7]
- Be more skeptical when the chatbot agrees with you. Sycophancy means a model is more likely to mirror your assumption back to you than to challenge it. Agreement is not validation.[4]
- Closed questions are safer than open-ended ones. "Is amoxicillin used to treat strep throat?" gives more reliable answers than "How should I treat my sore throat?" The audit found closed questions produced only 7.2% highly problematic responses, compared to 32% for open-ended ones.[1]
- Never share your full medical history with public chatbots. Privacy matters, and so does accuracy — feeding the model more personal detail does not make it a safer tool, only a more confident one.
- Cross-check anything that affects a decision. If a chatbot answer would change what you eat, take, or do — or whether you seek care — verify it with a clinician before acting on it.
- Watch for confident certainty without caveats. Real medicine is full of "it depends." If a chatbot sounds more sure than your doctor would, that's a warning sign, not a reassurance.
The Evidence at a Glance
Several recent studies, taken together, paint a consistent picture: AI chatbots fail in predictable ways across health topics, settings, and tasks.
| Study | Setting | Key Finding |
|---|---|---|
| BMJ Open Audit, 2026[1] | 250 health prompts, 5 chatbots | 49.6% problematic; 19.6% highly problematic |
| BMJ Open by chatbot, 2026[2] | Per-model breakdown | Grok 58% highly problematic; Gemini lowest |
| Hallucination Study[4] | 300 fabricated clinical scenarios | 50–82% hallucination rate; even GPT-4o at 53% |
| Mount Sinai Triage, 2026[5] | Emergency triage scenarios | AI under-triaged 52% of genuine emergencies |
| Reference Accuracy Study[7] | 500+ AI-generated citations | Only 32% accurate; nearly half partly fabricated |
| Frontiers Digital Health, 2025[6] | Clinical guideline match | Best (Perplexity) 67% match; ChatGPT/Claude 33% |
Where AI Can Actually Help
The studies don't say AI chatbots are useless. They say these tools are misused. Reasonable, lower-risk uses include:
- Translating medical jargon — a discharge summary, an imaging report, a lab printout — into plain language
- Generating questions to bring to your next appointment
- Summarizing what a clinical guideline broadly covers, before you read the guideline itself
- Reviewing a medication list for general drug class and side effect categories
- Preparing for a procedure — what to expect on the day, common recovery timelines, what to ask in advance
In each of these cases, the chatbot is doing translation or summarization rather than diagnosis. The ground truth still lives in the source document or in your clinician's judgment.
The Bottom Line
AI chatbots are powerful information tools that fail in predictable, well-documented ways. The pattern across the 2026 evidence is clear: confidence is not accuracy. A model that sounds sure is not a model that has reasoned its way to the right answer.
Use these tools as a starting point, not an ending point. For anything that affects what you eat, take, or do — and especially for anything that touches on whether you should seek emergency care — talk to a clinician before acting on what an AI told you.
Telehealth makes that conversation easy. A short video visit with a board-certified physician can confirm or correct what a chatbot said, often the same day. That extra step is the safety net the chatbot itself cannot provide.
References
- BMJ Group. "Substantial amount of medical information provided by popular chatbots inaccurate and incomplete." BMJ Group News, April 2026. bmjgroup.com
- eWeek. "BMJ Open: AI Chatbots Health Misinformation Study." 2026. eweek.com
- News-Medical. "Study finds popular AI chatbots often give problematic health advice." April 16, 2026. news-medical.net
- PA Media. "AI chatbots often hallucinate and give inaccurate medical information, study finds." 2026. pa.media
- Mount Sinai Health System. "Research identifies blind spots in AI medical triage." February 2026. mountsinai.org
- "AI chatbots versus clinical practice guidelines for lumbosacral radicular pain." Frontiers in Digital Health. 2025;7:1574287. frontiersin.org
- Physician Leaders. "Hallucinations and Fabricated Citations in AI: What Physicians Need to Know." Physician Leadership Journal. physicianleaders.org