#33 - Patients Don’t Talk Like Textbooks Artwork

Code & Cure

Decoding health in the age of AI

Hosted by an AI researcher and a medical doctor, this podcast unpacks how artificial intelligence and emerging technologies are transforming how we understand, measure, and care for our bodies and minds.

Each episode unpacks a real-world topic to ask not just what’s new, but what’s true—and what’s at stake as healthcare becomes increasingly data-driven.

If you're curious about how health tech really works—and what it means for your body, your choices, and your future—this podcast is for you.

We’re here to explore ideas—not to diagnose or treat. This podcast doesn’t provide medical advice.

All Episodes

Code & Cure

#33 - Patients Don’t Talk Like Textbooks

February 26, 2026 • Vasanth Sarathy & Laura Hagopian

0:00 | 29:56

What if the most confident answer in the room is also the most misleading?

Large language models can ace medical exams, yet falter when faced with a real person’s messy, incomplete story. In this episode, we explore how that gap plays out in one of medicine’s highest-stakes decisions: triage. Drawing on Laura’s experience in emergency medicine and Vasanth’s background in AI research, we unpack a new study where laypeople role-played both routine and high-risk conditions and turned to leading LLMs for advice. The surprising twist? Tiny shifts in phrasing produced opposite recommendations—“rest at home” versus “go to the ER”—revealing how sensitive these systems are to prompts, and how an agreeable tone can drown out critical clinical signals.

We take you inside the exam room to contrast what clinicians actually do. Real diagnosis isn’t a single question and answer—it’s an evolving process. Doctors gather a history that unfolds with each response, test competing hypotheses, and scan for subtle red flags and nonverbal cues that never show up in a chat window. From the ominous “worst headache of my life” to abdominal pain that could signal gallstones—or a heart attack—Laura explains how risk-first thinking and strategic follow-ups shape safe decisions. Meanwhile, Vasanth breaks down how preference-tuned models are trained to satisfy users, not challenge them—and why linguistic confidence can increase even as clinical accuracy declines. The study’s findings are sobering: models struggled to identify key conditions, and their triage decisions were no better than basic symptom checkers.

But this isn’t a story of hype or doom—it’s about design. Reliable medical AI must interrogate before it interprets. That means structured red-flag checks, resistance to user-led anchors like “maybe it’s just stress,” and clear, actionable next steps instead of overwhelming option lists. Calibrated uncertainty, transparent reasoning, and human oversight can transform AI from a risky decider into a valuable assistant.

If you care about digital health, safe triage, and the future of human-AI collaboration in medicine, this conversation offers a grounded look at both the limits—and the real promise—of these tools.

If this episode resonated, follow the show, share it with a colleague, and leave a quick review to help more listeners discover Code and Cure.

Reference:

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Andrew M. Bean et al.
Nature Medicine (2026)

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

Welcome To Code And Cure

SPEAKER_02 0:00

Patients don't present like vignettes. They ramble, hedge, contradict themselves, and that breaks a lot of these so-called SMART systems.

SPEAKER_00 0:19

Hello and welcome to Code and Cure, a podcast where we discuss decoding health in the age of AI. My name is Vasan Sarathi, and I'm an AI researcher and cognitive scientist, and I'm here with Laura Hagopian.

SPEAKER_02 0:31

I'm an emergency medicine physician and I work in digital health.

SPEAKER_00 0:34

Yeah, so today's topic is, I think I'm very excited about it. It overlaps with some of my own research work, is um human LLM or human AI interaction. Now, a lot of folks, you know, when we hear about AI in the news or, you know, all its amazing capabilities, we're all we all get pretty excited. We're all like, whoa, this can do this, it can do that. It passes all the, you know, the the licensing exams, like legal exams, medical exams, it can do complex math problems. And you see all these stories about what the AI system can do, and it's very exciting. And and one part of you is like, whoa, this thing is really smart. It can reason, it can do math, it can learn, you know, all the stuff that it needs to learn about whatever topic, right? Become an expert in uh in medicine, in law, in science, and so on. But I think and and it's funny because while that may be partially true, there is a piece here which we've kind of not really hit upon. And I think that this paper that we talk about today is going to really dive into that, which is sure, it might have that knowledge, but is it using it appropriately? And how is it when it's actually engaging with a human being about those topics? And in this paper that we're talking about, it's in the medical space, and I'm I'm super excited about it.

SPEAKER_02 1:52

Yeah, and just from my own perspective, I can tell you, because I've taken all these licensing exams and clinical practice is like completely different. Uh I'm not saying I don't need the the medical background, I didn't need the tests or whatever. I did. Uh I think they are an important piece of training. They ensure you have appropriate background knowledge. But when you're actually seeing patients, that's a whole different thing. Yeah.

SPEAKER_00 2:15

That's a whole different beast. Why is that? Can you tell us a little bit more about that?

How Clinicians Build A Differential

SPEAKER_02 2:18

Well, I mean, it's it's like people are unpredictable, right? When you get a um, you know, a case on a licensing exam, they're providing you with all the details, right? It's like, oh, this is, you know, the age, this is the sex of the person, this is what they came in with. And they're gonna have all of those details, like, oh, they presented with right upper quadrant pain in their abdomen, and they've had nausea and vomiting, and it happens after greasy foods, and they've noticed it happening more frequently, and this episode didn't go away, and now they have a fever. And so now you're like, oh, well, they had gallstones and now maybe their gallbladder is infected. Yes. So that's that's all fine and good. But when you have a patient that comes into you, they're not giving you all that detail. They might be like, oh, like, I I think dinner didn't agree with me tonight. I threw up, my stomach's hurting. Um, and and that's why I'm here. And then you take their temperature and it's high, and then you start to ask them more questions, like, where is the pain? Right. Um, have you noticed this kind of pain before? How often have you been getting it? Et cetera. And so you're piecing together a story. A patient doesn't come in necessarily knowing what all the relevant details are.

SPEAKER_00 3:31

Now, is that something you learn during like a residency where you're actually seeing patients and people who are more experienced than you kind of teach you how to ask the right questions or figure out, you know, like where do you learn that skill of inquiry and figuring out what the issue is?

SPEAKER_02 3:47

Yeah, I mean, you start to learn that skill in medical school, right? You're you're you're in the wards on medical school too. So you learn how to take a history and you're developing in your mind, oh, what's the differential diagnosis? Like, what are the things that this could be? If someone comes in with right upper quadrant pain uh related to meals, could it be related to acid reflux? Could it be related to um, you know, dyspepsia, acid in the stomach? Could it be related to gallstones? Could it be related, you know, to something else, whatever? So you're starting to think in your mind, okay, here's what this could be. And now you're starting to like hone in on, okay, let me figure out if they have this symptom or that symptom, or it's gonna be more consistent with um acid reflux if they say X, and it's gonna be more consistent with gallstones if they say Y. You're you're putting that information together and synthesizing it, and you're expanding the history that you're taking at the same time.

SPEAKER_00 4:39

So when they come in and they say they have a stomach pain, but then they tell you, hey, it might be something that I didn't eat, I maybe something I ate last night or whatever. Maybe it's not that, right? How do you totally so when you hear that? Do you say, oh, that's a possibility, but it's also could be uh a red herring.

Sensing The Patient Beyond Words

SPEAKER_02 4:56

Absolutely. Guess what? Yeah, everyone comes in with stomach pain related to a meal because you eat multiple times a day. Yeah. So so a lot of times people think it's it's related to a meal and it may, and it might be, yeah, right. In the case of Gallstones, it tends to happen after eating greasy foods, or it could totally not be. Maybe someone just literally got a stomach bug and it happened to show up after they ate lunch. But because you eat three meals a day plus some snacks, they were like, oh, it started after lunch. And doesn't mean it was related to lunch. Yeah. So that's part of the job is to like figure out what are the relevant details, what are the irrelevant details. You know, if someone comes in with pain, you're like, okay, how often do you get it? How long does it last for? How severe is it? What's the location? Like you're trying to get all this information so that you can figure out, okay, what are the next steps? What do I want to do next? What tests do I order? What do I like? What do I think this could be? And therefore, what tests do I want to order? You could have someone presenting with right upper quadrant abdominal pain and they could be having a heart attack, right?

SPEAKER_00 5:55

So you need to think through those things. But how do but the space of those things is so large that at the beginning, when they just give you something that's like a one-liner, I have a stomach pain in the right upper quadrant, which they might have not even said.

SPEAKER_02 6:07

They probably don't say it. They probably say my my stomach hurts after I got takeout left tonight.

SPEAKER_00 6:11

So then you have to like do physical exams.

SPEAKER_02 6:14

Well, that's the other thing, right? Is like you're looking at the patient and you're examining the patient, right? And so I gave you this example of a heart attack, um, presenting with right upper quadrant pain. But I actually had someone who was like, she was put into the hallway as maybe being intoxicated and she was throwing up.

SPEAKER_01 6:30

Yeah.

SPEAKER_02 6:30

And I looked at her and I said, She's throwing up like she's having a heart attack. And I don't I can't even tell you why I said that. Wow. I just looked at her and I was like, she doesn't look good. She looks gray. I don't know if she's intoxicated or not. They just brought her back, but she just looks off. The vomiting sounded like uh not, I don't know. I I'm not even like I'm telling it's like sixth sense. I'm not sure what made me say it, except that I've seen a lot of cases, right?

SPEAKER_01 6:56

Yeah.

SPEAKER_02 6:57

And so I said, let's get an EKG on her. And she was in fact having a heart attack. I I can't even pat myself on the back for this one because I don't even know what made me say it. But like that's the thing is you start to see, you see things and you examine people and you look around and you use your other senses, right? I would use my sense of smell a lot too. You could tell, like, how bad is this diabetic wound? Or sometimes I can smell strep throat. Like there's there's other pieces of information that you take in as a clinician that uh a patient may not notice or they may not find relevant that are absolutely important. If if you, for example, saw if I had two patients with a sore throat, sore throat for the most part, you know, it's like pharyngitis, often a viral infection or strep.

SPEAKER_00 7:42

But if one I'm like in my head, if somebody says sore throat, I'm like, oh, you have a cold.

SPEAKER_02 7:46

Yes, right. Right. But now I have a patient with a sore throat who's like spitting out their own saliva because they can't swallow it. I see that. Does the patient know that that's a relevant detail? I don't know. Like they might not say it, but I see that and I see that patient, and I'm like, that's not good. Because I've seen a number of sore throats, they shouldn't be spitting out their saliva. Something's going on that they can't even swallow. Like, that's that's not a good sign.

SPEAKER_00 8:10

If that patient had called you on the phone and wasn't present and you didn't see it, would you know to then ask the question? Or how would that that would be way harder, I would imagine, over the phone.

From WebMD To LLMs

SPEAKER_02 8:20

It would be harder over the phone, but that's the thing is like when people are trained in triage, like you're calling a you know, a 24-7 nurse line or you're calling the on-call physician, they're gonna ask for red flag symptoms. Are any of these things happening? Because if any of these things are happening, you need to go to the ER. Otherwise, you can wait until morning to be seen. Ah, got it. Okay. That's that's it though. Like the the provider is asking questions and the provider is using multiple senses. And so it doesn't always translate to uh an untrained chat bot, right?

SPEAKER_00 8:52

Well, that's it. And so this, you know, right now we are living in the world where we have access to an LLM. It used to be the case before LLMs that people would go on WebMD, right? They would have just type their symptoms into Google, and Google would give them links to WebMD or whatever else, and then they would try to read about it and try to make their own assessment, uh, not medically informed at all, unless they're doctors. Um, and and they would come to you and say, I think it's this and this or whatever. They would come and give you more information, maybe. But they would try to do that just to figure out what's wrong with it. Now you have LLMs and you have this seemingly, like I said before, seemingly smart thing that you know companies are touting as passing medical exams and knowledgeable about everything, that now you're saying, wait a minute, maybe I can just ask it and it would help me uh figure out what I have what's wrong with me.

The Study Setup And Scenarios

SPEAKER_02 9:40

So And not only what's wrong with you, actually, in this study they did what's wrong with you, and also what is your disposition? So where should you go? Right. And that's actually maybe even more important because, like, if something's an emergency, you want someone to be sent to the ER, right? Whereas, right, you know, do they need an ambulance? Do they need to go to the emergency department? Do they need to go to urgent care? Can they can they go to their regular primary care provider, or is self-care okay?

SPEAKER_01 10:07

Yeah.

SPEAKER_02 10:08

And I would argue that may be an even more important um distinction because once you get to that place, they can help figure out what's going on. Yeah. But if you are told, like in this study, there's someone who, and we can get into the details in a second, but there's someone who said, Oh, I have a terrible headache, my neck is stiff. And the model comes back and says, Oh, it might be a migraine. Try resting in a dark, quiet room. When in fact, the diagnosis was a subarachnoid hemorrhage, a bleed in the brain. Right. And it was just no position.

SPEAKER_00 10:42

It was in no position to make that conclusion.

SPEAKER_02 10:44

Especially if this person stays home and they get worsening bleeding and they have either a bad, you know, a poor outcome, they they die, they have like a worsening bleed in their brain, etc. Like this is a person that would need um an intervention. They need, you know, a neurosurgeon to go and try to stop the bleeding. They need to be in the neurologic ICU. So if they're told to stay home because they have a migraine, that is not a good thing. That's right.

SPEAKER_00 11:11

That's right. And in so I so let's get to this paper. Um, and so what they did in this paper was what exactly?

SPEAKER_02 11:19

So first they created these medical scenarios because they didn't want to like have actual patients with actual symptoms doing this. So they had doctors create these like very typical medical scenarios. One of them was actually gallstones, like I said um earlier.

SPEAKER_00 11:32

Uh, and so they would medical scenarios that might be found in a board exam, right? In a medical board exam.

SPEAKER_02 11:38

Or just like in general in life. It's like they they were not uncommon things. Um, and they were things that you wouldn't want, you know, necessarily want to miss either.

SPEAKER_00 11:47

So I gave it this is important because if it's uncom if it's like a common thing and it's you know fine, it's like a regular thing or whatever, um that's something you would expect the LLM to know, right?

SPEAKER_02 11:58

Yeah, or a red flag thing, like a subarachnoid hemorrhage in the brain, the bleed in the brain causing a headache. That's not super common, but it is something that in emergency medicine you like need to know about, need to rule out.

SPEAKER_01 12:10

Yeah.

Same Case, Different Prompts, Opposite Advice

SPEAKER_02 12:10

And so they would give very specific details to a lay person, say like to you. And and so the let's let's run with this example of the subarachnoid hemorrhage in the brain. It's gonna give very specific details of the case. So, Vasanth, you are playing the part of a 20-year-old male patient who is suddenly experiencing a very severe headache. It developed on Friday night while out at the cinema with friends. It's gonna keep going. It's gonna give general details about your life, like you're studying for a degree in electrical engineering. It's gonna give additional medical history, like you don't have any um health conditions, no, no, no family history, et cetera. So it's giving you actually a lot of detail.

SPEAKER_00 12:48

And then we say to you, hey, it's a character sheet that you now have as a person who's participating in this study.

SPEAKER_02 12:54

Exactly. Yeah, but now you're free to go and chat with either GPT 4.0 Llama or Command R and tell it what your symptoms are. And you may not, and this is what they found, you know, spout everything off to type, first of all.

SPEAKER_00 13:12

And why would you repeat? I mean, that's they're asking you specifically in the study to take that persona on, internalize it, and then just have a conversation about it, right?

SPEAKER_02 13:21

And so in this example, you know, two the one person said uh to to the um to the LLM, I have a terrible headache and my neck is stiff, so I can't look down. In addition to that, the light hurts my eyes. And that person was told, oh, try resting at home for a migraine. Whereas somebody else typed in to another LLM, I have suddenly developed the worst headache ever. It came on suddenly. I can't concentrate, and my neck is stiff. Also, I cannot stand the light. What action should I take? And in this case, it's similar, it's pretty similar history. Doesn't again, no full details in either of them. Right? Yeah. But it says seek immediately seek immediate medical attention. Sudden severe headache with neck stiffness and light sensitivity could indicate meningitis or a brain hemorrhage. Go to the ER or call emergency services. Okay. And so it's interesting because it's the same scenario. Interpreted a little bit differently, or or the way that the human presented it to the LLM was slightly different. Yeah. But they got very different answers back.

SPEAKER_00 14:22

Yeah, there might have been slight more urgency in the second response. And, you know, that maybe I mean these LLMs also bear in mind have all kinds of guardrails that they I'm putting this in air quotes, but uh guardrails in the sense that they have mechanisms to you know uh artificially kind of like flag certain things and just not answer them, right? And then uh basically re redirect you. So maybe it it triggered one of those flags, right? That kind of language that the human used there. But the first one didn't, and it was the same exact case, right?

Why LLMs Don’t Ask Follow‑Ups

SPEAKER_02 14:53

Right. And the way that the participants presented it to the LLM also was different, similar, yeah, but different. And I think that piece is really important because you can't expect uh a history to be presented to the LLM like a physician might write out the history. Oh, this is a 20-year-old male who presented with sudden onset headache, um, it was a thunderclap, uh, he is unable to concentrate, etc., etc., etc. So that's something that I might write in my clinical note as the history of the present illness, right? That is not what a lay person is going to come in saying.

SPEAKER_00 15:34

Yeah, yeah. So, first of all, let's be clear here that the uh subject in this experiment is actually doing the thing that you would normally do in a realistic setting, is you would just say the thing you've experiencing at this moment and completely underspecified everything else because you don't know what's relevant, you're not the doctor. And and it they're doing the same exact thing. And then the LLM is coming back with first of all, I find it very interesting that the LLM didn't follow up with questions. Um, which seems to me that that's a natural thing an expert would do is that saying, look, this is highly underspecified. I need more details, and I'm the expert, so I know what questions to ask to help me uh diagnose this. So I'm gonna ask the right questions. And the LLM doesn't seem to naturally do that, right? Which is the first, I think that's the first issue to me. It doesn't matter what knowledge it has, it doesn't matter if it passed all the medical exams. If it can't diagnose a particular situation, that is apply the knowledge that it has to a particular situation, to me, then that's useless from a triaging or medical analysis standpoint.

Prompt Sensitivity And Sycophancy

SPEAKER_02 16:35

And it's interesting because sometimes the users or their participants act asked pointed questions. So in that second participant who said, I have developed the worst headache of my life, I cannot stay in the light, etc. They follow the the thing that they wrote at the end of their of typing was what action should I take? You're right. And that is what the model answered. Seek immediate medical attention. It didn't follow up with you're right, like it didn't follow up with more questions, but uh there is this like sort of sycophantic piece where it's like trying to do what you've asked it to do. Yeah. And I, you know, is a it might have needed more information to answer that question, but it's still doing what you asked it to do.

SPEAKER_00 17:16

You might also have a guardrail, like I said, that said you can't give medical advice or whatever, and therefore you must, you know, redirect it. Uh and that triggered it.

SPEAKER_02 17:24

It definitely did give advice though, and and um it identified relevant conditions in less than 35% of cases, and it got the the disposition like where should you go out of these five spots? So there's like a 20% chance you'll just get it right. It it identified it in less than uh 44% of cases, and it was no better than like people just going online and trying to figure it out with Google and WebMD or et cetera.

SPEAKER_00 17:52

Wow. So it was a pretty big study, right? They had a quite a bit, quite a number high number of page, like thousands of uh of of of participants who tried this, and they you know had all the the studies that were uh vignettes that were given were all vetted by doctors, and they had a separate set of doctors also doing differential diagnosis over this.

SPEAKER_02 18:12

Um and there wasn't anything weird in here. Like when I was reading through some of the cases, it was like, oh, gallstones, okay, that's a subarachatic hemorrhage, okay, that's a kidney stone. Like they were when I looked at them, I was like, these are very straightforward. Yeah, there's not anything particularly weird about them. Yeah. Um, they were they were like typical, I don't know.

SPEAKER_00 18:31

You might have asked one or two textbook cases, right? Yeah, what would you have done with that first question? I'm curious that somebody present if somebody came in that question we just talked about, where um they just said they had a headache or whatever. I mean, it it'd be curious, like what I don't know that they did this in the study. They didn't, I don't think they ran users chat chatting with doctors.

Confidence Isn’t Accuracy

SPEAKER_02 18:52

No, they didn't run that. But if someone came in and said, I have suddenly the worst headache ever, my mind's already going to subarachnoid hemorrhage. But so now I'm gonna ask more questions like, oh, do you get do you get headaches often? Like, oh, um, you know, five minutes before the headache, did you have any symptoms at all? I'm trying to understand, was it really sudden onset?

SPEAKER_01 19:13

Yeah.

SPEAKER_02 19:14

Um, do you have a family history of any uh brain bleeds, right? Because there's actually a familial component to it. So there's a lot of questions you're asking to say, okay, is the am I going down the right route or not?

SPEAKER_00 19:27

Um You're you're also doing something which is very interesting, which is from just from hearing you, uh, which is that you're trying to model the user a little bit. Because you're trying to figure out, you know, how dramatic the person is, how they're what they mean when they say sudden, because those are all like relative terms and it's different for different people. And so you kind of have to figure out, okay, is for this person, what do they mean by that? Because that's relevant, right? If it's really sudden versus not sudden, you know, that's a question of being precise about this particular person, right? It's not a generic, it's not like a numeric value. They didn't come in and say to you that I had this, and then five seconds later I had this, right? It's not precise. So once you get into that realm, you have to like start to think about, you know, is this person just exaggerating a little bit and that's their nature? Or are they being under are they understating things, which is actually really potential, right? It could be like, no, no, no, I'm fine, I'm totally fine, you know, but they're not. So like there's that. There's a you're also actively modeling and trying to figure out who this person is when they communicate.

SPEAKER_02 20:22

Mm-hmm. Absolutely. Yeah. And then you're just thinking through, okay, like you know, you want to rule out the bad things. Yeah so you know, let's talk about this next stiffness. Uh, have you had any fevers? I'm worried about meningitis, like what you're you're trying to rule out the bad stuff and to understand what is currently going on so that you can decide on the next steps.

SPEAKER_01 20:41

Yeah.

SPEAKER_02 20:42

Um and uh that back and forth is really important.

Real World Beats Board Exams

SPEAKER_00 20:47

Yeah. So I think I think the the other piece I wanted to kind of hit upon beyond the the fact that the LLM and should be asking questions and trying to get get at the you know, figure figuring out the relevant facts before making a conclusion. Uh I think the other piece that was very interesting to me was uh the impact or the sensitivity of the LLMs to the the user's um um text. So you know, we we saw that already, right? Those two questions that the two the two examples you gave us um had the user say pretty much the same thing, but slightly different in language, slightly different different in tone, and that made a huge difference. And I think these LLMs are very sensitive to the actual words that are coming through and what what that indicates. And so I think that's really important because there was one example they gave, which I thought was very interesting, was that uh the users might ask a question too. The users might say, Hey, I think is this stress? Am I stressed? And it turns out what that did was actually made the LLMs worse. It um it can it it caused them to go down a path that they maybe shouldn't have gone down, right? And as a doctor, uh just because a user says something doesn't mean you immediately uh, you know, again, this goes back to the sycophantic piece, right? You don't immediately think that they're right, and therefore I should think about stress too. Um You might have already considered it, first of all. But second of all, that isn't going to like that that's part of the equation, but it isn't like dramatically going to change how you view this.

SPEAKER_02 22:08

That's absolutely true. And and by the way, like you I've had people who come in with chest pain who are like, but could this just be stress? Because they're hoping that it just is stress and it's actually a heart attack. Right.

SPEAKER_00 22:19

Right. Right. And and apparently that kind of example um constrained the LLMs in a direction that they, you know, took them down the wrong path, which is again hugely problematic, right? Because you want the if the LLM is in fact the expert, right, and is able to provide expert knowledge, then this piece where the LLM is trying to please you kind of contradicts that, goes against that. And you know, in this capacity, the LLM is not, should not be set up to try to please you. It should be set up to get get at the facts and get at what is relevant and then figure out the right answer. Right.

Human Cues And Decision Support

SPEAKER_02 22:52

And and that's true of physicians, also, right? Like as a provider, you're you're in a way you're in a customer service field, right? But not the same way as if uh, you know, I used to be a waitress, right? Not the same way as when I was a waitress. It it's not like customer is always right, right? Because the what you're trying to do is you have to balance that with, hey, I need to medically do the right thing for this person.

SPEAKER_01 23:18

Yeah.

SPEAKER_02 23:19

And that might not be what they want to hear. They might not want to hear they're having a heart attack, or they might not want to hear that they need surgery, right? Right. They might want to hear that they can just go home and they're fine. And that's not how it works. You can't be sycophantic in medicine. Yeah. You have to, you know, be methodical and figure out what's happening so that you can figure out how to treat that patient.

Limits Today, Promise Tomorrow

SPEAKER_00 23:43

Yeah. And for those wondering what we keep throwing the word sycophantic around, I think we talked about it in prior podcasts. But the idea here is that the LLMs, when they're trained, it turns out they're also trained. We've talked about this before. They're trained on lots of text and they're trained to predict the next word. You give them a word sentence and they follow it on. And they're they're able to do that because they're able to have internalized all of these internet, you know, a whole bunch of language that humans have used. But what these um, what a lot of techniques do to avoid the LLMs from going down crazy paths is they do what's called preference learning, or uh the technical term for it is um reinforcement learning uh through human feedback or RLHF. And the idea there is to say, hey, there are some answers that the LLM gives us that are slightly better for us than others. And this is just based on pure human preference and not based on factual aspects of the answers or anything else. Um and it's usually useful when you have a subjective thing and you want the more preferred answer or you want more preferred line of reasoning or whatever. And and so a lot of these big LLMs are trained that way. And the the effect of that is that they tend to be sycophantic, which is a fancy way of saying they tend to try to please you all the time. So you'll see this when you use the LLM where you might say something, you might say something wrong, and then it will just agree with you, and then you'll fix it and you'll be like, no, no, but that's wrong. And it'll be like, oh yeah, that's right. You're wrong, you're right. It is wrong. And it'll just back and forth contradict itself just to please you. And that behavior is a consequence of this preference learning that's happened. And in this, what we're saying is in this particular instance in the medical setting, it just doesn't work.

SPEAKER_02 25:17

It's not good. Yeah, yeah. It's not good because pleasing someone is not the goal here, right? The goal here in this case was like let's figure out what some relevant conditions are, what's the differential diagnosis, what could be going on, and then let's get the right disposition plan for the for this patient.

SPEAKER_01 25:34

Yeah, yeah.

SPEAKER_02 25:35

And that's that's totally a different ballgame.

SPEAKER_00 25:38

Yeah. I I think another angle that I wanted to take as well. So we've talked about sort of LLM pushing back, asking questions, understanding the facts. We've talked about the prompts being sensitive so that, you know, depending on how the user asks things, it provides completely different answers. That's a problem. You know, I I a couple of other things was I noticed that they in this paper, they talked about people anthropomorphizing um LLMs and treating them as other human specialists.

SPEAKER_02 26:03

And like, hey, oh, it sounded really confident, but I believe what it's saying.

SPEAKER_00 26:07

It's like that's not a human. Yeah. I mean, that's the linguistic fluency for these LLMs. They are meant to, again, for the human preference standpoint, they're very confident. And we've seen this in other um, I think we've recorded another podcast where the the sometimes the confidence can be inversely proportional to the accuracy. So, like the cases where the LLM is least accurate is the ones where they're most confident, which is again problematic, right? Again, it's it goes back to the interaction piece. And if they are seeming confident, then that's going to be a problem for the user as well, right? So this is completely separate from it passing all these medical exams, right? Well, I think that's the thing, right?

SPEAKER_02 26:43

It's like it's so exciting to see these LLMs score really well on medical exams.

SPEAKER_01 26:48

Yeah.

SPEAKER_02 26:49

Uh, it's exciting to see that they can solve, you know, quote unquote solve problems. When they're given all the information, it's so much easier. Yeah. But what we're seeing in this situation is that when you ask a layperson to interact with an LLM, it's not the same. It does not do even close to as well. And that is the real world scenario, not the licensing exam.

SPEAKER_00 27:14

Yeah. And it's also, there's a piece here which is also understanding what our human intent is. And that's also high very implicit in our in our communication. Uh I think what we're getting to is a lot of things are unsaid in human communication that need to be picked up on. Uh, we humans pick up on that when we communicate with each other. And an L effective LLM communication or effective chatbot communication needs to be able to pick up on these cues. Right. There's there's uh there's lots of that. In addition to like being a good doctor and seeking out the right answers and asking all the questions, it needs to be able to pick out these cues. And one of those cues is that um humans are not always great at deciding based on options. And we saw that too here. The LLM would sometimes come back with a whole host of options for moving forward, and humans were bad decision makers. And as a doctor, you might know that, you might realize that, and you know, present enough information where the options are, you know, where they can make good decisions based on that, right? And so I think there is a whole piece here. Uh again, we're going back, we're sort of going back in cycles here, but you know, the idea is that the human um the the user interaction is very important.

SPEAKER_02 28:31

I think that that's it. At the end of the day, it's not how well this these LLMs do on a licensing exam. It's actually about how well these LLMs do when they're actually interacting with real humans who are unpredictable, who might not provide all the information, who may throw in red herrings. That's what real clinical medicine includes.

SPEAKER_01 28:57

Yeah.

SPEAKER_02 28:58

And there's a big gap there when you look at how well it does on the licensing exam and how well it actually does interacting with a regular user who may not be sharing all the relevant details because that's what medicine is really about, is you're trying to piece together what's happening. Nobody's handing that to you on a silver platter. Right. So I think we can end here. Um, I'm glad to report that uh I will still have a job, especially after this paper. Um, not that LLMs can't be uh can't assist us, but there are significant limitations here in the human LLM interaction.

SPEAKER_00 29:43

Thank you for joining us.

SPEAKER_02 29:44

We'll see you next time on Code and Cure.

Laura Hagopian

Host

Vasanth Sarathy

Host