Key Takeaways

A 2026 BMJ Open audit of 5 popular AI chatbots found 49.6% of health responses were problematic, with 19.6% rated highly problematic.^[1]
Grok generated the most highly problematic responses (58%), while Gemini generated the fewest.^[2]
Citation quality was poor across all models — average reference completeness was just 40%, and no chatbot provided a fully accurate reference list.^[1]
A separate Nature-style study found hallucination rates of 50–82% when chatbots encountered fabricated clinical details, even with safety prompts.^[3]
Open-ended health questions produced 32% highly problematic responses; closed yes/no questions produced only 7.2%.^[1]

AI chatbot smartphone interface with warning triangle and magnifying glass examining the response for accuracy — A 2026 BMJ Open audit found that nearly half of health information from popular AI chatbots was problematic, with confident-sounding but inaccurate citations.

The Trust Gap Between Confidence and Accuracy

Millions of people now type health questions into ChatGPT, Gemini, Grok, Meta AI, and Claude every day. The replies arrive in seconds — articulate, organized, and confident. They look like the kind of explanation a thoughtful clinician might give.

That confidence isn't matched by the accuracy of the answers. A new audit published in BMJ Open in April 2026 tested five of the most popular chatbots against 250 health questions and found that nearly half of the responses were problematic.^[1] Some answers were partly wrong. Others were dangerously wrong. Almost all of them sounded equally certain.

This article translates that research into rules you can actually use. The goal isn't to scare you away from AI tools — it's to help you tell the difference between when they're useful and when they aren't.

What the BMJ Study Actually Found

The audit covered five chatbots: ChatGPT, Gemini, Grok, Meta AI, and Claude. Researchers asked each one 50 health questions across five topic areas — cancer, vaccines, stem cell therapies, nutrition, and athletic performance.^[1] The questions were a mix of open-ended ("How should I treat my sore throat?") and closed yes/no formats.

Out of 250 total responses, 49.6% were rated problematic by independent reviewers. Of those, 30% were "somewhat problematic" and 19.6% were "highly problematic" — meaning they contained outright misinformation or could plausibly cause harm if acted on.^[3]

Performance varied by model. Grok produced the highest rate of highly problematic responses at 58%. ChatGPT came in at 52%, Meta AI at 50%, and Gemini had the lowest rate of the group.^[2] Claude landed somewhere in between. None of the chatbots were close to error-free.

The topic mattered too. Vaccines and cancer — areas where the medical evidence is well-established and the scientific consensus is strong — produced the most reliable answers. Stem cell therapies, athletic performance, and nutrition produced the worst.^[3] Those are exactly the areas where pseudoscience and marketing hype crowd the training data.

One more striking finding: out of 250 questions, the chatbots refused to answer only twice. The rest of the time, they answered with confidence — even on topics where a careful clinician would have expressed uncertainty or recommended seeing a doctor.^[1]

The Three Mechanisms Behind AI Errors

To use these tools safely, it helps to understand why they fail. The errors aren't random glitches that the next software update will fix. They are built into how large language models work.

Pattern matching, not reasoning. Chatbots don't evaluate evidence the way a clinician does. They generate text by predicting which word is statistically likely to come next, based on the patterns in their training data.^[4] If a question pattern looks similar to one they have seen before, the answer will sound plausible whether or not the underlying medical facts are correct.

Sycophancy. Modern chatbots are fine-tuned on human feedback, and one of the things human raters tend to reward is agreement. As a result, the models lean toward telling you what you seem to want to hear.^[4] If you ask a question that contains a wrong assumption — for example, "Why does my high-dose vitamin C protect me from cancer?" — a chatbot may reinforce the assumption rather than correct it.

Citation hallucination. The BMJ audit found that average reference completeness across all five chatbots was only 40%, and not a single model produced a fully accurate reference list.^[1] A separate analysis of more than 500 AI-generated citations from ChatGPT and similar tools found that only 32% were fully accurate, with nearly half at least partly fabricated.^[7] Many references look real — author names, journal titles, page numbers — but point to papers that don't exist or don't say what the chatbot claims.

Comparison checklist showing safe AI chatbot uses including general education and preparing questions versus unsafe uses like emergency triage and prescription advice — The research suggests AI chatbots can be useful for general health education but should not replace clinical judgment for diagnosis, triage, or treatment decisions.

Eight Practical Safety Rules

Synthesizing the findings across several 2025–2026 studies, the following rules give you a working framework for using AI chatbots without getting burned.

Never use AI for emergency triage. A Mount Sinai study found ChatGPT Health under-triaged 52% of genuine emergencies, often steering people away from urgent care when they needed it most.^[5] If something feels urgent, call 911 or go to an ER — don't ask a chatbot.
Treat the answer like a Wikipedia entry, not a doctor visit. AI responses are useful for orientation and vocabulary. They are not useful for decisions about your specific care.
Verify every citation. Search the actual journal database (PubMed, the publisher's site) before trusting a reference. The BMJ audit found roughly 60% of references were incomplete, and other work has shown nearly half are partly fabricated.^[1]^[7]
Be more skeptical when the chatbot agrees with you. Sycophancy means a model is more likely to mirror your assumption back to you than to challenge it. Agreement is not validation.^[4]
Closed questions are safer than open-ended ones. "Is amoxicillin used to treat strep throat?" gives more reliable answers than "How should I treat my sore throat?" The audit found closed questions produced only 7.2% highly problematic responses, compared to 32% for open-ended ones.^[1]
Never share your full medical history with public chatbots. Privacy matters, and so does accuracy — feeding the model more personal detail does not make it a safer tool, only a more confident one.
Cross-check anything that affects a decision. If a chatbot answer would change what you eat, take, or do — or whether you seek care — verify it with a clinician before acting on it.
Watch for confident certainty without caveats. Real medicine is full of "it depends." If a chatbot sounds more sure than your doctor would, that's a warning sign, not a reassurance.

The Evidence at a Glance

Several recent studies, taken together, paint a consistent picture: AI chatbots fail in predictable ways across health topics, settings, and tasks.

Study	Setting	Key Finding
BMJ Open Audit, 2026^[1]	250 health prompts, 5 chatbots	49.6% problematic; 19.6% highly problematic
BMJ Open by chatbot, 2026^[2]	Per-model breakdown	Grok 58% highly problematic; Gemini lowest
Hallucination Study^[4]	300 fabricated clinical scenarios	50–82% hallucination rate; even GPT-4o at 53%
Mount Sinai Triage, 2026^[5]	Emergency triage scenarios	AI under-triaged 52% of genuine emergencies
Reference Accuracy Study^[7]	500+ AI-generated citations	Only 32% accurate; nearly half partly fabricated
Frontiers Digital Health, 2025^[6]	Clinical guideline match	Best (Perplexity) 67% match; ChatGPT/Claude 33%

Where AI Can Actually Help

The studies don't say AI chatbots are useless. They say these tools are misused. Reasonable, lower-risk uses include:

Translating medical jargon — a discharge summary, an imaging report, a lab printout — into plain language
Generating questions to bring to your next appointment
Summarizing what a clinical guideline broadly covers, before you read the guideline itself
Reviewing a medication list for general drug class and side effect categories
Preparing for a procedure — what to expect on the day, common recovery timelines, what to ask in advance

In each of these cases, the chatbot is doing translation or summarization rather than diagnosis. The ground truth still lives in the source document or in your clinician's judgment.

The Bottom Line

AI chatbots are powerful information tools that fail in predictable, well-documented ways. The pattern across the 2026 evidence is clear: confidence is not accuracy. A model that sounds sure is not a model that has reasoned its way to the right answer.

Use these tools as a starting point, not an ending point. For anything that affects what you eat, take, or do — and especially for anything that touches on whether you should seek emergency care — talk to a clinician before acting on what an AI told you.

Telehealth makes that conversation easy. A short video visit with a board-certified physician can confirm or correct what a chatbot said, often the same day. That extra step is the safety net the chatbot itself cannot provide.

References

BMJ Group. "Substantial amount of medical information provided by popular chatbots inaccurate and incomplete." BMJ Group News, April 2026. bmjgroup.com
eWeek. "BMJ Open: AI Chatbots Health Misinformation Study." 2026. eweek.com
News-Medical. "Study finds popular AI chatbots often give problematic health advice." April 16, 2026. news-medical.net
PA Media. "AI chatbots often hallucinate and give inaccurate medical information, study finds." 2026. pa.media
Mount Sinai Health System. "Research identifies blind spots in AI medical triage." February 2026. mountsinai.org
"AI chatbots versus clinical practice guidelines for lumbosacral radicular pain." Frontiers in Digital Health. 2025;7:1574287. frontiersin.org
Physician Leaders. "Hallucinations and Fabricated Citations in AI: What Physicians Need to Know." Physician Leadership Journal. physicianleaders.org

Parth Bhavsar, MD

Board-Certified Family Medicine Physician

Dr. Bhavsar founded TeleDirectMD to deliver board-certified physician care through telehealth. He follows the published evidence on AI in clinical settings and writes about how patients can use these tools without trading away their safety.