Why does AI hallucinate medical citations so often?

Large language models generate text by predicting statistically likely next tokens — not by retrieving verified facts from a database. When asked to produce a citation, the model constructs something that looks like a citation based on patterns in its training data. It may correctly identify real authors who work in a field, real journal names, and plausible dates, but combine them into a reference that never actually existed. This is why hallucinated citations are often convincing — they are assembled from real components in a plausible format.

Is AI ever reliably accurate in medical contexts?

Yes. AI performs considerably better on tasks that stay within its training corpus and do not require precise bibliographic recall. Research shows AI models perform well on answering general medical knowledge questions about high-prevalence conditions, identifying potential drug interactions when grounded in vetted databases, drafting administrative documentation under physician review, and general clinical Q&A. The key distinction is between tasks requiring accurate recall of specific bibliographic details (where hallucination rates are high) versus general reasoning about well-documented medical topics (where accuracy is considerably better).

How should clinicians use AI tools safely given hallucination risks?

Published guidance and the FDA Digital Health Advisory Committee recommendations both emphasize a human-in-the-loop approach. Practically, this means: never accepting AI-generated citations without independently verifying each reference in PubMed or a database; using AI for drafting and synthesis but treating all factual claims as provisional until verified; being especially careful with rare-disease information and any content outside high-prevalence conditions well-represented in training data; and using AI clinical documentation tools with structured clinician review workflows, not unedited submission.

Hallucination Rates in Medical AI: What the Citation Fabrication Data Tells Us

Key Takeaways

GPT-3.5 fabricates 55% of bibliographic citations in literature reviews; GPT-4 reduces this to 18% — still a meaningful error rate for clinical use.^[1]
Bard (Google's early LLM) hallucinated 91.4% of medical references in a systematic review replication study, with zero relevant papers correctly retrieved.^[3]
Clinical documentation hallucination runs lower — 1.47% per sentence — but 44% of those errors are classified as clinically significant enough to affect diagnosis or management.^[5]
A global clinician survey found 91.8% had encountered medical AI hallucinations, and 84.7% believed they could directly cause patient harm.^[6]
AI performs considerably better on general clinical reasoning than on bibliographic recall — the gap between tasks shows where human verification is essential versus where AI adds reliable value.^[1]^[5]

Bar chart visualization showing AI citation fabrication rates: GPT-3.5 at 55%, GPT-4 at 18%, and Bard at 91% with dissolving citation fragments in the background — Citation fabrication rates vary dramatically across AI models. GPT-4 cut GPT-3.5's rate by two-thirds — yet even 18% fabrication is clinically unacceptable without verification. Data from Walters & Wilder (Sci Rep 2023) and Chelli et al. (JMIR 2024).

What "Hallucination" Means in Medical AI — and What It Does Not

In AI research, a hallucination is not a simple factual error. It is something more specific: plausible, fluently generated content that has no basis in the underlying data. When a large language model hallucinates a citation, it does not retrieve a misremembered reference — it constructs one from scratch. The author names are real researchers who publish in that field. The journal exists. The title sounds right. The DOI is formatted correctly. The paper does not exist anywhere in any database.

This is what makes AI hallucination in medicine genuinely different from other types of AI errors. A wrong number in a calculation is obvious. A fabricated citation survives casual inspection because every individual component is plausible. Published research confirms that clinicians and researchers frequently cannot distinguish AI-fabricated citations from real ones on first review — they look the same in a reference list.

The distinction from simple errors also matters for how we respond. Factual errors in AI output can often be caught by general vigilance. Hallucinated citations require a separate verification step — specifically, checking each reference against PubMed or a comparable indexed database — because general reading cannot detect them.

AI hallucination in clinical documentation works somewhat differently. Here, the model is summarizing a real patient encounter transcript. Hallucinations in this context tend to be invented findings (fabrication), contradictions of stated facts (negation), mixing in information from unrelated context (contextual errors), or speculative causal statements unsupported by the source material. All four types can affect patient care if they reach the clinical record uncorrected.

The Citation Fabrication Numbers: Model by Model

The quantitative picture of AI citation hallucination began taking shape in 2023 with two independent studies published within months of each other. Both reached alarming conclusions about GPT-3.5 — and showed that GPT-4, while substantially better, still posed real verification challenges.

Walters and Wilder, writing in Scientific Reports, generated 84 short literature reviews across 42 multidisciplinary topics using GPT-3.5 and GPT-4, then verified each of the 636 resulting citations against multiple academic databases.^[1] The results were striking: 55% of GPT-3.5 citations were fabricated — they did not exist as published works. GPT-4 cut that number to 18%, a major improvement, but still meant that nearly one in five citations in a GPT-4 literature review was invented. Among non-fabricated GPT-3.5 citations, 43% contained substantive errors in author names, dates, volume numbers, or page numbers. Even GPT-4's real citations carried a 24% error rate.

Bhattacharyya and colleagues, publishing in Cureus, took a slightly different approach — asking GPT-3.5 to generate 30 short medical papers with at least three references each, then checking all 115 resulting citations.^[2] The outcome was even more severe: 47% were complete fabrications. Only 7% were authentic and accurate. The remaining 46% were real papers rendered with significant errors — wrong PMIDs, wrong volume numbers, wrong years of publication. The mean number of inaccurate components per citation was 4.3 out of a possible seven fields evaluated.

Gravel and colleagues at Montreal's CHU Sainte-Justine tested ChatGPT across a diverse set of medical questions and found 41 of 59 evaluated references (69%) were fabricated — despite appearing deceptively credible.^[8] Most fabricated citations used names of authors with genuine publications in the field, real journal names, and coherent formatting. Twenty-nine of the 41 fabricated articles were reportedly published in known, indexed journals with plausible volume and page numbers that matched the journal's formatting conventions.

Infographic showing AI use case spectrum from reliable tasks on the left (drug interaction checks, administrative documentation) to verification-essential tasks on the right (citation generation, rare disease literature) — Not all AI tasks carry the same hallucination risk. The gap between citation generation (high-risk) and general medical reasoning (lower-risk) has practical implications for how clinicians and researchers should use these tools.

Evidence Comparison: Hallucination Rates Across Studies

The table below consolidates the primary quantitative findings from the major hallucination studies to date. Rates are not directly comparable across rows — task type, prompt design, model version, and verification methodology all affect the numbers — but taken together, they convey the scale of the problem.

Study	Model Tested	Task	N Items	Hallucination / Fabrication Rate	Citation
Walters & Wilder 2023	GPT-3.5	Literature review citations (multidisciplinary)	636 citations	55% fabricated	Sci Rep 2023^[1]
Walters & Wilder 2023	GPT-4	Literature review citations (multidisciplinary)	636 citations	18% fabricated	Sci Rep 2023^[1]
Bhattacharyya et al. 2023	GPT-3.5	Medical paper references (30 papers)	115 references	47% fabricated; 93% with ≥1 error	Cureus 2023^[2]
Chelli et al. 2024	GPT-3.5 / GPT-4 / Bard	Systematic review replication (rotator cuff)	471 references (33 prompts)	GPT-3.5: 39.6% · GPT-4: 28.6% · Bard: 91.4%	JMIR 2024^[3]
Lamiaa et al. 2024	GPT-3.5, Bing, Bard, Elicit, SciSpace, Perplexity	Medical reference generation (10 prompts)	500 references	ChatGPT 3.5 RHS = 11 (highest); Bard failed to generate any references; Reference relevancy hallucination: 61.6%	JMIR Med Inform 2024^[4]
Gravel et al. 2023	GPT-3.5 (ChatGPT)	Medical Q&A references (diverse topics)	59 references	69% fabricated	Mayo Clin Proc Digit Health 2023^[8]
Asgari et al. 2025	Multiple LLMs (clinical note generation)	Clinical documentation summarization	12,999 sentences, 450 notes	1.47% per sentence; 44% of those = clinically major	npj Digital Med 2025^[5]

Beyond Citations: Clinical Vignettes and Documentation Hallucination

Citation fabrication gets the most attention because it is easy to measure — a reference either exists or it does not. But the more clinically consequential hallucination problem involves what AI models do when generating or summarizing actual patient information.

Asgari and colleagues published what is among the most rigorous evaluations of clinical documentation hallucination to date, in npj Digital Medicine in 2025.^[5] They analyzed 12,999 clinician-annotated sentences from 450 AI-generated clinical notes across 18 experimental configurations, covering 49,590 transcript sentences. The headline hallucination rate of 1.47% per sentence sounds reassuringly low. Read the next line: 44% of those hallucinated sentences were classified as major — errors that could directly affect patient diagnosis or management if left uncorrected.

The breakdown of hallucination types in Asgari's data reveals where the clinical danger concentrates. Fabrication accounted for 43% of hallucinations (completely invented information). Negation — where the model contradicts a stated clinical fact, for example documenting that a symptom was absent when the patient reported it — accounted for 30%. These negation hallucinations are particularly dangerous because they can directly invert a clinician's documented finding. Contextual errors (17%) and causality hallucinations (10%) round out the picture.

Hallucinations in clinical notes appeared most commonly in the Planning section (21% of major hallucinations), followed by Assessment (10.5%) and Symptoms (5.2%). The planning section matters most: it is where treatment decisions are documented and where downstream providers look for care instructions.

Earlier clinical vignette studies showed AI models amplify errors in a related way. When AI tools are given clinical vignettes with embedded diagnostic challenges, research has documented that plausible-but-incorrect reasoning can propagate through AI-generated differential diagnoses — with error amplification rates cited in the range of 80% or more across scenarios where the AI anchored on a misleading detail in the case presentation. These errors differ from documentation hallucinations but share a root cause: the model optimizes for plausibility, not accuracy.

The Asgari group also showed something more hopeful: prompt engineering and workflow optimization can substantially reduce hallucination rates. In their GPT-4 experiments, targeted prompt iteration reduced major hallucinations by 75% and major omissions by 58%. The baseline problem is real, but it is not fixed — it responds to structured interventions.

The Clinician Perspective: What Surveys Show

The quantitative studies above describe laboratory conditions. The clinician survey data describes what is happening in actual practice — and the picture is sobering.

A 2025 global survey of 70 clinicians across 15 specialties, published as part of a systematic analysis of medical AI hallucinations, found that 91.8% had personally encountered medical AI hallucinations while using AI tools in their work.^[6] Among those, 84.7% believed the hallucinations they encountered were capable of causing direct patient harm. These are not hypothetical concerns from clinicians who have never used AI — these are physicians and specialists who have used these tools and encountered problems firsthand.

The same research found that physician audits of hallucination cases identified that 64–72% of residual hallucinations stemmed from failures in causal or temporal reasoning rather than simple knowledge gaps. This reframes the problem: the solution is not purely about training more medical data into models. A meaningful proportion of hallucinations arise from how models reason about sequences of events and cause-effect relationships — a fundamentally different challenge from knowledge recall.

Clinicians surveyed identified three primary causes of hallucinations, with roughly equal attribution: insufficient training data (51.7%), biased datasets (51.7%), and architectural limitations of current transformer models (50.0%). The convergence across these categories suggests clinicians are accurately identifying that hallucination is a multifactorial problem without a single technical fix.

The regulatory response confirms these concerns. The FDA Digital Health Advisory Committee's November 2024 meeting formally included hallucination rates among required premarket evaluation metrics for generative AI-enabled medical devices.^[9] The Committee called for postmarket monitoring specifically targeting hallucination detection and adverse events, and reaffirmed that maintaining a human-in-the-loop is essential to patient safety in any clinical AI deployment.

Why Hallucination Rates Vary: Training Data, Task Type, and Prompt Format

Not all AI tasks produce equivalent hallucination risk. Understanding why rates differ so dramatically — 1.47% in structured clinical documentation versus 55–91% in citation generation — helps identify where verification effort should concentrate.

The core explanation is the difference between reasoning tasks and recall tasks. When an AI model answers a general clinical question about, say, the mechanism of action of metformin, it is pattern-matching against a large, well-represented body of training data. Errors occur but are relatively uncommon for common conditions with extensive published literature. When the same model is asked to cite a specific paper with accurate authors, journal, volume, and page numbers, it must retrieve a precise bibliographic fact — something fundamentally different from general reasoning. Large language models do not retrieve from an indexed database; they generate the most statistically probable token sequence. Fabricated citations are the result.

Training data exposure amplifies this gap. High-prevalence conditions with extensive literature coverage produce lower hallucination rates on clinical questions. Rare diseases with sparse training representation produce substantially higher hallucination rates — the model generates plausible-sounding information where its training data is thin.

Prompt format also affects hallucination risk in ways that have practical implications. Research on clinical documentation has shown that highly formal clinical text prompts can increase hallucination rates compared to more conversational input formats — the model may "fill in" expected clinical language patterns even when the source material does not support them. The Asgari group's finding that prompt optimization can reduce major hallucinations by 75% shows how consequential these design choices are.

The Lancet systematic review published in 2026, analyzing 97.1 million verified references in published biomedical papers, found a 12-fold increase in fabricated reference rates from 2023 to 2025 — rising from four per 10,000 papers to over 56 per 10,000.^[7] Review articles had fabrication rates 57% higher than other paper types. This reflects a troubling secondary effect: AI-generated hallucinated citations are now appearing in peer-reviewed literature, suggesting that verification workflows have not kept pace with AI adoption in research writing.

What This Means: Where AI Is Reliable vs. Where Verification Is Essential

The evidence reviewed here does not support a blanket conclusion that medical AI is unreliable. It supports a more specific and practically useful conclusion: hallucination risk is task-dependent, and the tasks with the highest risk are identifiable in advance.

Current evidence shows AI performs reliably enough for supervised use in several medical contexts:

AI Performs Reliably When Used For:

General medical knowledge Q&A about high-prevalence conditions well-represented in training data
Drafting administrative documentation (prior auth letters, referral notes) under physician review before submission
Drug interaction lookups when the AI is grounded in a vetted pharmacological database rather than generating from memory
Structured clinical documentation with human review and workflow-optimized prompts
Differential diagnosis generation as a starting point for clinician reasoning — not a final answer

Verification Is Essential When Using AI For:

Generating bibliographic citations — every reference must be independently verified in PubMed or an indexed database
Summarizing rare-disease literature — hallucination rates are substantially higher where training data is sparse
Producing patient-facing prescription instructions — never transmit without prescriber review of every line
Clinical documentation with negation-type statements (patient does not have X) — specific vulnerability to negation hallucinations
Any clinical reasoning in nuanced presentations where multiple competing diagnoses apply

The distinction between these two categories is not about trust in AI as a concept — it is about matching the tool to the task. A physician who uses AI to draft a referral letter and reviews it before sending is applying AI where it adds efficiency with manageable risk. A researcher who submits an AI-generated literature review without checking each citation is relying on a tool in exactly the context where its failure rate is highest.

Bottom Line

The data on medical AI hallucination is now substantial enough to draw firm conclusions. Citation fabrication rates between 18% and 91% — depending on model and task — are not edge cases or early-adoption bugs. They reflect a structural property of how large language models generate text. GPT-4 is meaningfully better than GPT-3.5. Models will continue improving. But the improvement trajectory does not make current tools safe for unverified citation use in clinical or academic contexts.

The clinical documentation picture is more encouraging. A 1.47% per-sentence hallucination rate is manageable with structured review workflows — and published data shows it can be reduced substantially with prompt optimization. The 44% rate of clinically major errors among those hallucinations means the workflow cannot be perfunctory; it requires genuine clinician attention to the output, not rubber-stamp approval.

The clinician survey data — 91.8% encountering hallucinations, 84.7% believing them capable of patient harm — is not a call to stop using AI in medicine. It is a call to use it with the same critical evaluation that clinicians apply to any new diagnostic tool or information source. Research evidence has never been trusted without verification. AI-generated content should not be the first exception.

Frequently Asked Questions

What is the hallucination rate in medical AI?

Rates vary substantially by task and model. For bibliographic citations, GPT-3.5 fabricates roughly 55% and GPT-4 fabricates 18%, per Walters and Wilder's Scientific Reports analysis.^[1] For systematic review reference generation, Bard reached 91.4% hallucination — the highest published rate for a major model.^[3] For structured clinical documentation, the per-sentence hallucination rate drops to about 1.47%, though 44% of those errors carry clinical significance.^[5]

Why does AI fabricate medical citations?

Large language models generate text by predicting statistically probable next tokens — not by retrieving facts from an indexed database. A citation requires precise bibliographic recall of a specific combination of author names, journal, year, volume, and page numbers. When that combination is not directly represented in training data, the model constructs a plausible-looking citation from its component parts. The result appears credible but may not correspond to any real paper.

Is AI ever reliably accurate in clinical settings?

Yes. Current evidence shows AI performs well on general clinical Q&A about common conditions, administrative documentation drafting under review, and differential diagnosis generation as a starting point. The key distinction is between tasks requiring precise bibliographic recall (high hallucination risk) and tasks requiring general reasoning about well-documented medical topics (substantially lower risk).

How should clinicians handle AI tools given these hallucination rates?

The FDA Digital Health Advisory Committee's guidance and published clinical research both point to the same answer: human-in-the-loop review for every high-stakes output.^[9] Practically, this means checking every AI-generated citation against PubMed before use, reviewing AI clinical documentation sentence by sentence before it enters the medical record, and treating AI output as a draft requiring physician verification — not a final product.

References

Walters WH, Wilder EI. "Fabrication and errors in the bibliographic citations generated by ChatGPT." Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5. pmc.ncbi.nlm.nih.gov/articles/PMC10484980/
Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. "High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content." Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238. pubmed.ncbi.nlm.nih.gov/37337480/
Chelli M et al. "Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis." J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164. jmir.org/2024/1/e53164/
Lamiaa A et al. "Reference Hallucination Score for Medical Artificial Intelligence." JMIR Med Inform. 2024 Jul 31;12(1):e54345. doi: 10.2196/54345. medinform.jmir.org/2024/1/e54345/
Asgari E et al. "A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation." npj Digit Med. 2025 May 13;8:274. doi: 10.1038/s41746-025-01670-7. nature.com/articles/s41746-025-01670-7
"Medical Hallucinations in Foundation Models and Their Impact on Healthcare." arXiv preprint 2503.05777. 2025. arxiv.org/pdf/2503.05777.pdf
Topaz M et al. [Systematic review of fabricated references in published biomedical papers]. The Lancet. 2026. Reported via CIDRAP: cidrap.umn.edu
Gravel J, D'Amours-Gravel M, Osmanlliu E. "Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions." Mayo Clin Proc Digit Health. 2023 Jun 12. pmc.ncbi.nlm.nih.gov/articles/PMC11975740/
U.S. Food and Drug Administration. Digital Health Advisory Committee Meeting Summary, November 20–21, 2024. fda.gov/media/184078/download
Chelli M et al. "From Innovation to Inaccuracy: The Impact of ChatGPT on Orthopaedic Surgery Research Citations in Sports Medicine." J Orthop Exerc Interv. 2026 Jun 14. journaloei.scholasticahq.com

Parth Bhavsar, MD

Board-Certified Family Medicine Physician · Editor in Chief, TeleDirectMD Health Guides

Dr. Bhavsar founded TeleDirectMD and leads its editorial library. His work on AI in medicine focuses on translating complex research findings into evidence-based guidance for clinicians and patients. He tracks the peer-reviewed literature on AI hallucination rates, diagnostic accuracy, and regulatory developments as part of the When Medical AI Goes Wrong series.

About the Editor · NPI Registry · ORCID