Code & Cure

#43- AI Hype Vs Real-World Medicine

Vasanth Sarathy & Laura Hagopian

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 27:06

What if the headline “AI outperformed doctors” is asking the wrong question? When a Harvard emergency triage study makes waves, it’s easy to focus on the most dramatic takeaway. But the real story is more complicated: what did the study actually test, and what parts of emergency medicine did it leave out?

We slow down the hype and take a closer look at what AI can and cannot tell us about clinical decision-making. We unpack how today’s AI excitement fits into a much longer history of bold promises, from the early optimism of the Dartmouth Conference to modern “AI summers” driven by funding, media attention, and novelty. They also explore what an “AI winter” really means, why confidence can collapse quickly, and how today’s ecosystem makes exaggeration easier to spread and harder to correct.

Then we turn to the realities of emergency care. ER triage is not about guessing one diagnosis or producing a neat top-five list. It is about urgency, risk, and judgment under uncertainty: identifying life-threatening possibilities, deciding what tests come next, and determining who needs immediate care, admission, or safe discharge. The conversation also highlights a major limitation of text-only AI evaluations: medical charts are already shaped by human clinicians, meaning the model may be relying on information that required real-world expertise to gather in the first place.

For anyone interested in trustworthy AI in healthcare, medical diagnosis, health misinformation, and the responsible use of large language models in clinical settings, this episode offers a clearer way to think beyond the headline.

References:

Performance of a large language model on the reasoning tasks of a physician
Brodeur et al.
Science (2026)

Did AI really beat ER doctors at ER triage?
Nope. A look at an interesting AI study that has led to some very overhyped headlines.

Kristen Panthagani
You can know Things, Substack (2026)

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

The Shocking Headline And First Doubts

SPEAKER_00

Can you believe this? AI outperformed doctors in a Harvard trial of emergency triage diagnoses.

SPEAKER_01

What hello and welcome back to Code and Cure, the podcast where we discuss decoding health in the age of AI. My name is Vasant Sarathy. I'm an AI researcher and cognitive scientist. And I'm here with Laura Hagopian.

SPEAKER_00

I'm an emergency medicine physician. Can you believe it? I I can't believe it, actually. I feel like it's um maybe a little bit overhyped.

SPEAKER_01

Yeah, I think the point of this podcast was don't believe it.

SPEAKER_00

Be a little bit careful when you see a headline that's like, whoa, oh my gosh.

SPEAKER_01

Yeah. And I think this is an important topic. There's, you know, hype with anything, right? Any new technology, any new product feels magical at the beginning. People are excited. There's a whole novelty aspect to it. Um, but this one has stuck around for a little bit and it's influencing things that are potentially critical and important. And, you know, it's not just a fun little toy that people are excited about.

SPEAKER_00

Um and for me, the headlines hit kind of close to home because I'm like, oh, this is my job. Is AI better than me at my job? But I do think there's this piece of hey, this the scientific paper came out. It was really interesting. And I'm glad that we're gonna dig into it um later today. But there's also this component of, okay, you know, NPR, Harvard magazine, all of these places are writing about this article and trying to simplify it and also trying to create a headline that you're gonna click on. Yeah. Yeah. And sometimes people just read the headlines. I do that. Sometimes I'm like, oh, that's a cool headline. I don't need any more information. But for this one, I was like, oh my gosh, I'm not sure I believe this headline. Let's read the paper.

SPEAKER_01

Yeah. And you know, I think what's going to be interesting for us as we talk through this is to think about all the players involved, what the incentives are for different people, what drives people to write these different types of, you know, uh different types of explanations of the science that that that's that's uh that's there. I think it's I think it's fascinating and potentially a little bit scary, but hopefully, you know, we can figure this out.

AI Summers And AI Winters Explained

SPEAKER_00

Um so when we were prepping for this podcast, you were like throwing out terms, Vassant, like, oh, the the AI summer or the winter. And I was like, what what are you talking about? But like this has happened before, right? Yeah. And so I'd love to hear what what is the summer and winter, other than like the regular seasons, and and what has happened before.

SPEAKER_01

Yeah. So AI, so I want to do a little bit of a history lesson. So AI has a very interesting history. The term AI was coined a long time ago, but um in the in the 50s, and there was there's always been a sense that we're almost there, that we have solved all of AI. And that's always been the case. So there was a there was a group of researchers back then who uh were the leading AI researchers, and they held a a conference at Dartmouth um where they decided uh this is a great example, and this is very famous. Most AI people already know this, but for those who are not, I'm not an AI person.

SPEAKER_00

You're gonna have to tell me.

SPEAKER_01

So it's called the Dartmouth Conference because it's really interesting because during that conference, all the top AI researchers who are who were the best minds to think about the challenges of AI, decided that um computer vision was an intern project for the summer. For as a summer intern project to solve computer vision. And I say computer vision, I mean computers being able to see things and make sense of it. And they really, honest to God, believed that that was a summer intern project.

SPEAKER_00

Like simple. They thought that that was a simple thing.

SPEAKER_01

Simple enough to do it quickly. I mean, not literally summer intern, maybe, but but but quick enough. Now that was in the that was a 70 years ago. And computer vision is an entire field of research with thousands upon tens of thousands upon researchers working, still working hard to figure out how we can get computers to see the way we do. And we've we have achieved great progress and we have all kinds of really cool tools. But again, back then, that's what those scientists, those people actually believed was a solvable problem. Now you can fast forward this to the 1980s. Uh similar sort of thing happened when the perceptron was invented. Um is the precursor to the neural networks that we all know now. Um, but it's a type of technology that kind of set the stage for what um uh would be a future neural network. In 1985, there was a New York Times article that the headline of which was Electronic Brain Teaches Itself, where they claimed that the Navy had demonstrated uh this perceptron and that it was the first non-living mechanism able to perceive, recognize, and identify its surroundings without human training or control. This is from the article. I'm reading this from the article. And even the scientist, uh Dr. Rosenblatt, um, you know, came out and said, we just one more step of development, a difficult step, he said, is needed for the device to hear speech in one language and instantly translate it to speech or writing in another language.

SPEAKER_00

Okay, that did not like happen in 1986 for all our listeners.

SPEAKER_01

Right, right, right, exactly, exactly. So um this notion of AI hype is is is been there always, and there's been repeated rounds of this. So um AI researchers like to call this in terms of seasons, an AI summer is when the AI hype is at its best. All the money and funding for AI research is pouring in. Research is able to work on things, people are talking about it, there's New York Times articles about it and such. And in the past, there's been multiple AI summers, and those have been followed by AI winters.

SPEAKER_00

So we're like in an AI summer right now, right? I would think so. Like, oh, everyone's using LLMs, it's in the news all the time, tons of stuff is getting funded, including by the government, etc.

SPEAKER_01

Yes. And um, but those AI summers in the past have been followed by AI winters. And AI winters coming. Winter is coming, right? Um, AI winters are basically when the world realized that the AI system can't do what everyone's been promising it can do. And given the state of current technol current technology then, that it wouldn't be able to do that, we would need a big big change. And so then the funding quickly dried up, all these promises went away, and AI research just dropped dramatically. Uh, and then, you know, over time it would sort of pick back up again and proceed.

SPEAKER_00

And the spring and fall seasons do exactly.

SPEAKER_01

I suppose, I suppose, but we just went from summer to winter, you know, and so let's just let's just it does happen pretty suddenly, and people realize those promises are not, uh can't be fulfilled.

SPEAKER_00

And so we go from like, hey, this is magic, it solves all the problems, to like, hey, it actually didn't do it didn't do what we thought it was doing. Yeah, we thought it could do.

SPEAKER_01

Now, those times that it happened before are potentially different, right? There was no internet in a lot of these previous times. There was not spread of information the way it is now. So the hype hype uh mechanisms are different and the and the follow-throughs are different, and the funding cycles are all different. So I'm not saying that a some winter is going to follow this summer. Um, but I will say one thing, which is people are arguing about it. And there is an issue, which is um, you know, a recent study came out about open AI, which said hundreds of billions of dollars have been spent by open AI and they haven't yet returned a profit. And this is ChatGPT. Right. And it wasn't enough that they have hundreds of millions of users, they hadn't reached a billion users, and that wasn't enough. And AI is still um adopted by a very small percent of the world. And AI, although it's got all this excitement and potential value, um there it people have yet to see full returns in terms of its actual value. There's still questions of trustworthy. We've talked about this in the podcast. There's questions of trustworthiness, there's questions of consistency, there's hallucinations, all these things that AI systems do are preventing people from fully taking it on and engaging with it. That said, that doesn't mean that we're gonna have an AI winter. It just means it could mean this time around that it's you know, it's less of a bubble popping and more of just like a slow transition from you know seeing AI as this magical silver silver bullet that's gonna solve everything to a more narrow tool that can do one or two good things, uh, one or two things good, right? Really well. So that's kind of uh you know that could happen. But the point is there's still the hype cycles, and we're in one right now. There's a massive hype cycle right now that's happening, telling us everything about what AI can do. And the article that we're gonna talk about today does that to some degree.

Why Today’s Hype Cycle Persists

What The Harvard Triage Study Tested

SPEAKER_00

Uh, not the article it's like the news articles about the article did that, right? It's like the article is very interesting and it actually called it, they do a very good job in the article about calling out some of the limitations. But when you look at the news headlines about it, it's like, oh, in real world tests, an AI model did better than ER doctors at diagnosing patients. And AI outperforms doctors. That's what a new Harvard study shows. And so when you just see the headlines, you're like, oh, like, do I even need to go to the doctor anymore? Can I just can I just go on to Chat GPT and ask? And and part of this was that they were kind of testing out a new model. And the new model, you know, generally did do better than older models. Um, but I think that what the study actually said is a is a bit different than what the headlines said it said.

SPEAKER_01

Yeah, exactly. So maybe let's start with the actual study, right? What did what did the actual study do?

SPEAKER_00

Well, in the study, they compared human experts and AI, you know, uh LLMs, large language models, and basically looked at how they each performed. And they used a set of two internal medicine doctors for the comparison. And they were looking at like how how good of a job did it do at diagnosing what a patient had at triage, um at the level of the ER physician, and at the level of admission. And this was a subset of patients who were kind of on the sicker side and all got admitted to the hospital.

SPEAKER_01

So this is both diagnostic diagnostic test tasks and management tasks that required some some sort of thinking and reasoning.

SPEAKER_00

Absolutely required clinical reasoning to do. And so they used a couple different, you know, LLM models, and then they had two internal medicine doctors um kind of see what they could figure out. And as these cases went along, from triage to seeing the ER doctor to admission to the hospital, more information becomes available, right? More tests are done, et cetera. So to try to figure out, okay, what could be what are the list of things that could be going on, and they narrowed that down to just five. You're only allowed to get five, and then like what's your top diagnosis? Um, and so in certain cases, you know, they were like, hey, the the model does better. And in other cases, they were like, hey, the model kind of did not outperform. It did about the same as prior models, especially when it came to, you know, uh diagnoses that you really don't want to miss, or like like landmark cases that it was given.

Why ER Triage Is Not Diagnosis

SPEAKER_01

So I guess the question I have is just on the process, um, is that I mean, you have experiences being an ER doctor, and that experience is that tracking well here? Because it seems like they were having people to diagnose and then they require them to only give back fat. I mean, what those some those seem kind of arbitrary. I don't know how closely they tie to can you talk a little bit about that?

SPEAKER_00

Yeah. I mean, I think that's one of the problems that I have with this paper because as an ER doctor, I wouldn't limit myself to five. So we would come up with differential diagnoses, right? And so what that means is like, hey, here's a list of what things this could be. So say you showed up to my ER with chest pain. Okay. Okay. Your chest pain could be a heart attack. It could be a blood clot in your lung, it could be like something terrible, or maybe you worked out too hard and you strained your chest wall muscle, or maybe you have acid reflux. And I could go on and on and on and make a giant list of things, some of which are life-threatening, like potentially like a trauma with a popped lung or something like that. And some of them are much more benign, like acid reflux. And my goal as an ER doctor was at first especially to triage just to figure out hey, what are the dangerous things it could be? And and those would often be higher up on my list because I want to rule them out. It doesn't mean that they're more likely necessarily, but it means that, like, hey, you're you showed up to the ER, you think you're having an emergency. How can I rule out or how can I check if you have this really bad thing? Yeah, yeah. Like, do you have risk factors for a blood clot? Do you, you know, do you have a history of hypertension, hyperlipidemia, diabetes, family history of heart disease? That would make me think that you could be having a heart attack. So, so those are the things that end up kind of top of my list because I want to rule those out.

SPEAKER_01

Yeah.

SPEAKER_00

And then I'm figuring out, okay, like how sick is this person? If your vital signs are abnormal, I'm gonna want to see you sooner and figure out what's going on. Like, oh, you know, could you be having an aortic dissection? Versus, are things looking more benign? And so part of this is like, hey, do we think you need to be admitted? Do you need to be seen quickly? Can you wait in the waiting room for a couple hours? Um, you know, do you need to go to the ICU? Like, do you need a surgery? So, so a lot of the ER doctor's job is not necessarily honing in on a single diagnosis or even five diagnoses right away, right? But figuring out, hey, is this person sick or not sick? What tests do we want to run on them to figure out what's going on? Right. And like what sort of management do we want to do for them in this moment and what's their sort of disposition going to be? Are they gonna kind of stay? Are they gonna go home? Do we not know yet? Et cetera. And I think that's a really important piece here because they were saying, oh, well, it's the AI models are better at ER triage at coming up with the diagnosis. And like my argument would be like, that's not the goal of ER triage. You're not supposed to come up with a single diagnosis or even just five there. Yeah. You're supposed to figure out, hey, what are we gonna do next for this patient? Should we be worried about them or not? Where do we think they're going? What tests do we want to run? And oftentimes, and this was part of my training, it will be skewed toward the more dangerous stuff that you want to rule out, even if it's less likely, because you don't want to miss that. You don't want to, you don't want to send a heart attack home.

SPEAKER_01

Right, right. So your top five might be a different top five just from from what the study would would do. And so that that makes sense because you're focused on a different goal. Uh, and your goal being is ruling out things, bad things, but also your goal being what to do next.

Text-Only Data And The Relevance Problem

SPEAKER_00

Yeah, exactly. And and they did use information from the health record um when they were asking both the the internal medicine physicians, they didn't actually use ER doctors um for this piece. Uh, when they were asking the internal medicine physicians and they were asking the AI AI models, they took clinical data from the health record. So there's some interesting pieces of this because they only took text-based stuff. Yes. Right? So when you're in the ER, like you get to Oh, you get to see the person. See the person. Like you talk to them. Yeah, yeah. You get to hear how they're acting, right? If someone's like in a ton of pain, they might be groaning and and that's not maybe there. Or you would get visual information like imaging studies. Um, you know, this I I joke around about this all the time, but there are certain smells that you get in the ER where you're like, oh, like I know that person's having a GI bleed. I can smell it. Yeah. Um, and so there's there's limited information that's available. And the information that's available, someone had to like put into the chart. Right. Right? So, like someone had to say, okay, uh, everyone gets vital signs. We all know someone's blood pressure, their heart rate, their temperature. That's done at triage. But like someone had to say, Oh, I'm gonna order these labs. Or here's the history of the present illness. A 56-year-old man with a history of diabetes, hypertension, and hyperlipidemia came in with chest pain. It's located on the left side of the chest. It's a pressure-like sensation, it's worse with exertion. It radiates into the right shoulder and is associated with shortness of breath. Yes. That information had to be put in, somebody had to ask the patient those questions. They may or may not have volunteered that on their own, right? Um, and then someone had to say, okay, that is the relevant information to put it into the chart.

SPEAKER_01

The problem of relevance is huge because there's a whole host of other things that are potentially irrelevant that have been smartly excluded uh by somebody, right? And any intelligent system needs to account for that, needs to account for the fact that things have been left out and only irrelevant and important things have been focused on.

SPEAKER_00

Yeah, exactly.

SPEAKER_01

That's actually also a historical um uh AI problem. It's called a relevance problem, is deciding what we need to take into account and what we don't. So the reasoning step is almost the easy step because now you already have all the relevant information. And now, given this part, let's do some analysis to figure out which how we compose that relevant information into something useful afterwards, right?

SPEAKER_00

And I would argue that like your clinician is reasoning as they ask those questions of the patient or decide what tests to order. If I am ordering a troponin or a D-dimer, or decide to order an EKG and a chest x-ray for someone with chest pain, lab tests, whatever, there's reasoning that went into that. This there's a decision tree in my head about what I'm going to do next. And so that is a really important piece of what was the AI did not do that as part of the study.

SPEAKER_01

Yeah.

SPEAKER_00

So it's important just to call that out that it's like taking the information that was basically curated already. Yes. And using that.

How Incentives Turn Papers Into Hype

SPEAKER_01

Yeah. And just to be clear, uh the paper, so this is an interesting piece, right? The paper itself has a whole section on limitations. They talk about it being only text. They talk about it only having certain other aspects of it. I think um, you know, what models they used and and what, you know, limited uh well, how the tasks are limited in whatever fashion. So they list out a bunch of limitations.

SPEAKER_00

Um and they actually say, hey, this is just a proof of concept. Probably in part because they only compared it to two internal medicine physicians. Right, right. So there's like proof of concept, yeah. This worked and it was pretty limited. Yes. And like I agree with that assessment. I think it's like a really cool thing that the the AI and the AI models have improved, and what they found was very interesting. Yeah. And it was very limited at the same time.

SPEAKER_01

And I want to say the paper, the motivation for the paper was to contrast the the whole body of work right now and the news cycles about how AI systems are passing medical exams. And their point was to say, yeah, yeah, that's all good, well and good, but we have this way of measuring uh the the, you know, from the 50s, we have this approach to measuring, you know, doing evaluations of of doctors and so on. Why not we use that with LLMs? So that's kind of where the framing of the paper started, right?

SPEAKER_00

Which totally makes sense. It's like your ability to score well on a licensing exam and your ability to do things with a like in the patient setting are like two very different things. Yes. And so it's it's like a very appropriate thing to have studied. The output of it was very interesting and lim and limited, which they do call out. I think the main issue is that when it went into news headlines, that's where the hype came in. And it was like, oh my gosh, AI per outperforms doctors in Harvard trial of emergency triage diagnoses. In fact, NPR, I think, actually ended up changing their headline after getting some feedback from physicians, like, hey, this actually they didn't compare it to ER doctors. Yeah. They compared it to in two internal medicine physicians who are who do different things.

SPEAKER_01

Yeah. And what was your this was your original um Yeah, the original said it did better than ER doctors.

SPEAKER_00

So they had to correct it because what the compare the comparison was to actually to internal medicine physicians who have a different job than the ER, right? Yeah. And they're not, for example, trying to diagnose people with the limited information that you get at triage. That's not their job. Right. And so you need to make sure that you're comparing the right, the right job.

SPEAKER_01

Right. Right. Right. Um, but yeah, this is an issue, right? Across the board, across in science. I mean, you have a bunch of researchers doing their best to do the right research. So their the study is good, their uh maybe the results are solid, maybe statistically everything works out, the math works out, um, maybe their design, experimental design is still good and they lay out all the limitations. They're being fully honest. Um, we saw in the example of the Perceptron in the 1980s that even scientists can sometimes be excited about their work to the extent that they're they're they're seeing the future, but they don't have a sense for how long things will take, and they get excited. So there are some exaggerations of claims made even at that level. But nowadays, it's much more common to see papers that are uh more realistic about what they are actually proposing and having a whole bunch of caveats in them. But what ends up happening is um researchers are incentivized to publish and to get more citations, which means promote their publications. So they are going to simplify it themselves. And you'll see this a lot on Twitter, for example, or X, where people are posting their own articles or articles by their students and summarizing the main findings in two or three bullets, which often ignore all the limitations because that's not part of the messaging, because they need people to read these papers and cite them and so on. So that's one incentive that that's but that's driving them. Then you have the university, for example, that the or the institution that the researchers are working with. And now they have these researchers with a paper at a top journal or a top conference, and they want to say to the world, hey, we are people who do high-quality research. And they have a bunch of science communicators who write those news releases, press releases from the official press releases from the university. And those oftentimes are again simplified for a larger audience, often for donors, often for other uh national governmental institutions that are going to provide them funding in the future. And so they're gonna say, talk about the article. They might list some caveats, but now it's already shortened and simplified. So it's gonna you know already have some aspects of um how. Hype in there because they want people to actually get excited by this. And then you have uh news outlets, uh, NPR, New York Times, um, whatever else, that are picking up on those press releases. And they also have science communicators who are also trying to do their job, but have are now twice removed from the original research itself. And they may or may not have had conversations with the researchers. I I don't want to suggest that they are their journalism is lacking or something. But I want to say that in the effort of getting their articles out, they're going to say that talk about the promise and the excitement of this technology. Or on the flip side, talk about all the bad things that can happen, right? So there's also that. There are papers written about all the bad things, and they're sort of doomsday papers. And they had the same problem on the flip side where they're saying all the negative things and not identifying what actually was promising. So now you have this, and so you have this chain of exaggerations that happen. And by the time it gets to, you know, like a Facebook feed, you're reading a very different version without any of the caveats and highly overgeneralized uh scientific research.

SPEAKER_00

And I do think there's a component of, hey, this headline is something we want to get people to click on. So even if the limitations, for example, are called out in the article itself, it's like your headline is meant to get you to read the story. Now, some people just read the headline and move on, right? But it's meant to be a little bit of clickbait. And when you think about posting, having things posted on social media, no one wants to read a whole long paper. They want to get, hey, what did this paper say? And like the paper did say some of the things that are in the articles or in the little, you know, bites of information, but it's not the whole story. And it's hard because like you and I went back and we read the whole article and we were like, hey, this is this is where it got overhyped and how and why. But I think it's really hard when you're looking at headlines like this to be like, hey, what does it actually say versus what is the hype? And you you may need experts to like actually interpret that for you. And I probably I would in other realms. Yep. And so I think that makes it really difficult because it's like, hey, this is trusted information, this is science, this is a scientific reporter who's putting it out there. And it is exciting, but I think it has that excitement has to be tamed a little bit when you understand what the limitations are.

Misinformation Risks And Smarter Sharing

SPEAKER_01

And I'll take it one step further. I mean, if you're, you know, all of this information, for example, the original paper, the news press release from the university and other news sources are all public now and being used to train our LLMs. So going forward, if you have a chat bot and you start asking questions about this, they're already diluted with not just the original paper, but all the other um exaggerations that come with it. So now are you getting the right answer when that when you ask it uh that that sort of question? Are you getting the appropriate caveats? Um, I I don't know. I think that those are open questions. And so you you now have a data set that is that is now diluted at some level with exaggerations. Um again, it's even if you have actors along the way that are not, you know, uh trying to make things worse or trying to be intentionally um, you know, bad in any way, you still have this thing because of the existing incentives of the system, right? The way the system is set up.

SPEAKER_00

Yeah, we all want something that's like exciting and fun and interesting and new. And here's a great use case, and that's very AI summer of us, I guess. Like, but I I think the limitations in this study are really important, and the idea that it was a proof of concept more than anything else. And so that has to be taken into account at the same time, or else we're gonna have that AI winter coming, right?

SPEAKER_01

Yeah. So, you know, when you see a news story on Facebook or something, you want to just ask yourself for a second, like, is this really true? Um, but I mean, you know, that and before sharing it, right? Because that's it's worth thinking about that for a second, um, especially the ones that seem too good to be true.

SPEAKER_00

Yeah, I think this speaks, I mean, we've talked about misinformation on the podcast before, and this is that same sort of concept where it's like, okay, before you share something, what what are experts saying about this? And how can we do some fact checking?

SPEAKER_01

Yeah.

SPEAKER_00

Um, and this is an example where it's the story has truth to it, but there are limitations that really did need to be called out because it wasn't meant to prove that AI is better than ER doctors. That's not what it was set out setting out to do. And but the headlines make you think otherwise. So uh I do also want to call out that some of the headlines have been changed since doctors wrote in and said, hey, if that's not what the study says, yes, you've got to change the headline, and and some of the news outlets have. So I think that's that's a good thing. That's a move in the right direction. Um, but I think we all have to be careful about this hype cycle, and we're gonna see it continuing on. Yep. All right. Well, thank you for joining us. We will see you next time on Code and Cure.

unknown

Thank you.