Key Takeaways

A 2026 Mount Sinai study found that ChatGPT Health under-triaged 52% of genuine emergencies, including cases of impending respiratory failure.^[1]
AI symptom checkers scored below 40% accuracy on difficult ER cases, missing 3 pneumonia diagnoses because patients lacked fever.^[2]
In one asthma scenario, the AI identified respiratory failure warning signs in its own explanation — then still advised the patient to wait rather than seek emergency care.^[1]
When family members minimized a patient's symptoms, AI triage shifted toward less urgent care with an anchoring bias odds ratio of 11.7.^[1]
Finnish data showed symptom checkers over-triaged respiratory infections 44% of the time, sending patients to unnecessary visits.^[5]

AI health chatbot on smartphone with respiratory symptom icons and physician silhouette representing clinical judgment — AI symptom checkers are used by millions, but recent studies reveal significant gaps in their ability to triage respiratory emergencies.

Introduction: The Promise and the Problem

Millions of people now type their symptoms into AI chatbots before calling a doctor. Cough, shortness of breath, wheezing — respiratory complaints are among the most common reasons someone opens a symptom checker at 2 a.m. You're worried, it's the middle of the night, and a chatbot gives you an answer in seconds.

But what kind of answer are you getting? In my practice, I've had patients arrive with printouts from AI chatbots telling them their breathing trouble was "likely viral" and would "resolve on its own." Some had pneumonia. One had early signs of a COPD exacerbation that needed immediate intervention.

The real question isn't whether AI can provide health information. It can. The question is whether it can safely tell you when your breathing problem is an emergency. I reviewed the published data. The short answer: it can't, at least not yet.

What AI Gets Right

Before I get into the problems, let me be fair. AI triage tools do some things well.

When the clinical scenario is textbook — classic stroke symptoms, anaphylaxis with throat swelling, crushing chest pain — AI chatbots reliably flag these as emergencies. The Mount Sinai study testing ChatGPT Health across 60 clinical vignettes confirmed this: the system correctly identified straightforward emergencies like stroke and severe allergic reactions.^[1]

A UK pilot study published in Cureus compared ChatGPT, Gemini, and the NHS 111 symptom checker across 10 clinical scenarios. All three AI systems correctly caught all 5 emergency cases, achieving a 90% overall match with the gold-standard triage decisions.^[3] That's a solid result for clear-cut presentations.

There's also promising work out of Iceland, where a machine learning model trained on 1,500 primary care records successfully stratified patients with respiratory symptoms by risk level. Low-risk groups in that study had zero pneumonia cases on chest X-ray — meaning the model effectively identified who didn't need imaging or antibiotics.^[4]

So AI works when the problem is simple. The trouble starts when it isn't.

Comparison of AI triage system with missed warnings versus physician with correct clinical triage decisions — AI tools handle textbook emergencies well but struggle with the atypical presentations that require clinical judgment.

Where AI Falls Short — The Respiratory Blind Spots

The most alarming finding comes from the Mount Sinai study published in Nature Medicine in February 2026. Researchers tested ChatGPT Health across 960 interactions spanning 60 clinical scenarios. The headline number: 52% of genuine emergencies were under-triaged — meaning the AI told patients it was safe to wait when they actually needed urgent care.^[1]

One respiratory case stands out. The AI was given an asthma scenario with signs of impending respiratory failure. In its own written explanation, the system correctly identified the warning signs. Then it told the patient to "wait and monitor." That disconnect — recognizing danger in its reasoning but failing to act on it in its recommendation — is exactly the kind of error that can cost someone their life.^[1]

West Virginia University researchers tested ChatGPT on 30 real emergency room cases and found overall top-3 diagnostic accuracy of 75–80%. But for the 13 most difficult cases, accuracy dropped below 40%. Three of those missed diagnoses were pneumonia — because the patients didn't have fever.^[2] That matters, because atypical pneumonia without fever is common, especially in older adults and immunocompromised patients. A trained physician knows to suspect pneumonia based on cough pattern, breathing rate, and chest findings even when the thermometer reads normal. The AI didn't.

The Mount Sinai study also exposed a dangerous pattern called anchoring bias. When the AI received the same clinical scenario but with added context — a family member saying "I don't think it's that serious" — the AI's triage shifted toward less urgent care. The odds ratio was 11.7, meaning the AI was nearly 12 times more likely to downgrade urgency based on a family member's reassurance rather than the clinical facts.^[1] In a physician's office, we're trained to recognize when a worried family member is minimizing. An AI chatbot takes the input at face value.

Finnish researchers studying the Omaolo symptom checker found it delivered safe assessments 97.6% of the time — which sounds reassuring until you see the other numbers. Exact triage match with nurse assessments was only 53.7%. And for respiratory tract infections specifically, the system over-triaged 44% of cases, sending patients to medical visits they didn't need.^[5] Over-triage isn't dangerous in the same way under-triage is, but it wastes time, costs money, and clogs an already strained healthcare system.

Study	Key Finding	Respiratory Relevance
Mount Sinai / Nature Medicine, 2026^[1]	52% of emergencies under-triaged	AI missed impending respiratory failure in asthma
WVU / Scientific Reports, 2025^[2]	<40% accuracy on difficult cases	3 pneumonia cases missed due to absent fever
UK Pilot / Cureus, 2025^[3]	90% gold-standard match overall	AI correctly caught all 5 emergencies in simple scenarios
Finland Omaolo / JMIR, 2024^[5]	97.6% safe but 53.7% exact match	Respiratory infections over-triaged 44% of the time
Iceland ML Model / Ann Fam Med, 2023^[4]	Effective risk stratification	Low-risk groups had zero pneumonia cases
JMIR Primary Care Review, 2026^[6]	AUC 0.82–0.94 in controlled settings	No real-world GP data; equity gaps unaddressed

What This Means for You

If you've typed "shortness of breath" into a chatbot, you're not alone. And I'm not here to tell you never to use AI health tools. They can be useful for looking up general information, preparing questions for a doctor visit, or learning about a condition after you've been diagnosed.

But the data is clear: AI is not safe for making triage decisions about breathing problems. A chatbot that misses half of all emergencies is not a tool you want making the call about whether to go to the ER.

What I tell patients: use AI to prepare for your visit, not to replace it. Look up your symptoms, write down your questions, and then talk to a real physician who can put those symptoms in context.

Respiratory Red Flags — Always See a Physician

Difficulty breathing or shortness of breath at rest
Chest tightness with exertion
Persistent fever with productive cough
Worsening asthma despite using your rescue inhaler
Bluish color around your lips or fingertips

If any of those apply to you, don't wait for a chatbot to tell you what to do. Call your doctor or go to the emergency room.

The Physician's Role in Respiratory Triage

Why does a physician catch what AI misses? Because clinical reasoning involves more than matching symptoms to a list.

When I evaluate a patient with respiratory symptoms, I'm looking at past medical history — asthma, COPD, heart failure. I'm reviewing medications: steroid tapers, recently stopped controller inhalers. I'm watching how they breathe. Are they using accessory muscles? Can they speak in full sentences?

On a video visit, I can see breathing patterns, count respiratory rate, observe skin color, and ask follow-up questions a chatbot wouldn't think to ask: "Is this worse when you lie flat?" or "Did this start after you ran out of your inhaler?" Those details change the triage decision entirely.

The University of Utah found that even among physicians in hospitals, pneumonia diagnoses change more than 50% of the time between admission and discharge.^[7] Pneumonia is genuinely hard to diagnose. It requires clinical judgment, repeat assessment, and sometimes imaging. An AI chatbot asking yes-or-no questions about fever and cough cannot replicate that process.

A 2026 systematic review in JMIR confirmed what most of us in practice already know: AI triage models show promising accuracy (AUC 0.82–0.94) in controlled research settings, but there is almost no data on how they perform in real-world primary care. The review also flagged a lack of equity-stratified data — we don't know whether these tools perform equally well across different ages, races, and socioeconomic groups.^[6]

The difference between "AI says wait" and "your doctor says go to the ER" can be life-saving. Telehealth with a real physician gives you that judgment — delivered through a screen, but grounded in clinical training. That's a world apart from typing symptoms into a chatbot and hoping the algorithm gets it right.

References

Ramaswamy A, et al. "ChatGPT Health performance in a structured test of triage recommendations." Nature Medicine. 2026. 960 interactions, 60 vignettes across 21 specialties. pubmed.ncbi.nlm.nih.gov
WVU / Scientific Reports. 2025. ChatGPT tested on 30 real ER cases. 75–80% top-3 accuracy overall, <40% on difficult cases. 3 missed pneumonia diagnoses. enews.wvu.edu
UK Pilot. "Comparison of ChatGPT, Gemini AI, and NHS 111 symptom checker across 10 clinical vignettes." Cureus. 2025. pmc.ncbi.nlm.nih.gov
Iceland ML model for respiratory symptom triage in primary care. Trained on 1,500 records. Annals of Family Medicine. 2023. pmc.ncbi.nlm.nih.gov
Finland Omaolo electronic symptom checker study. 97.6% safe assessments, 53.7% exact triage match, 44% over-triage for respiratory infections. JMIR Human Factors. 2024. humanfactors.jmir.org
"AI Triage in Primary Care" systematic review. AUC 0.82–0.94, limited real-world GP data, equity gaps unaddressed. JMIR. 2026. pmc.ncbi.nlm.nih.gov
University of Utah Health. "Hospital Pneumonia Diagnoses Are Uncertain, Revised More Than Half the Time." 2024. >2 million hospital visits analyzed. uofuhealth.utah.edu

Parth Bhavsar, MD

Board-Certified Family Medicine Physician

Dr. Bhavsar founded TeleDirectMD to deliver board-certified physician care through telehealth. He reviews emerging AI health tools against published clinical evidence and believes technology should support — not replace — the physician-patient relationship. Last reviewed: Mar 2026.