Code & Cure

#28 - How AI Confidence Masks Medical Uncertainty

Vasanth Sarathy & Laura Hagopian

Can you trust a confident answer, especially when your health is on the line?

This episode explores the uneasy relationship between language fluency and medical truth in the age of large language models (LLMs). New research asks these models to rate their own certainty, but the results reveal a troubling mismatch: high confidence doesn’t always mean high accuracy, and in some cases, the least reliable models sound the most sure.

Drawing on her ER experience, Laura illustrates how real clinical care embraces uncertainty—listening, testing, adjusting. Meanwhile, Vasanth breaks down how LLMs generate their fluent responses by predicting the next word, and why their self-reported “confidence” is just more language, not actual evidence.

We contrast AI use in medicine with more structured domains like programming, where feedback is immediate and unambiguous. In healthcare, missing data, patient preferences, and shifting guidelines mean there's rarely a single “right” answer. That’s why fluency can mislead, and why understanding what a model doesn’t know may matter just as much as what it claims.

If you're navigating AI in healthcare, this episode will sharpen your eye for nuance and help you build stronger safeguards. 

Reference: 


Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
Mahmud Omar et al.
JMIR (2025)

Credits: 

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/


SPEAKER_01:

Imagine asking an AI a medical question and getting a clear, confident answer. No hesitation, no caveats. Today we're unpacking new research that asks: should that confidence make us feel safer or more concerned?

SPEAKER_00:

Hello and welcome to Code and Cure, where we discuss decoding health in the age of AI. My name is Vasan Sarathi, and I'm an AI researcher and cognitive scientist, and I'm here with Laura Hagopian.

SPEAKER_01:

I'm an emergency medicine physician and I work in digital health. Uncertainty. That's the topic for today. And confidence, right? Yeah, exactly. And I have to tell you, uh, working in the ER, I worked in a field of lots of uncertainty. Like the undifferentiated patient came in all the time. Yes. You didn't know what was going on. We had to figure it out.

SPEAKER_00:

Right.

SPEAKER_01:

So it's very interesting when I look at uh LLM outputs now because they all seem so certain. They know the answer. And it's really convincing when you read them.

SPEAKER_00:

Well, that's the thing, right? I mean, you want um when you go and ask an expert for something, you want them to tell you, you know, you want them to give you an answer and have a certain degree of confidence about what they're gonna be able to do with that, right? And so you want to understand that and you want that and you kind of trust them based on that, right? So if you have an expert who is willing to be confident about what they know and also not know what they don't know and say that, then you're gonna trust them more and you know you're gonna believe their answers more, right? And that's a good thing. That's kind of what we want.

SPEAKER_01:

Well, I'm kind of la lashing on to that end piece of it, which is like, yeah, you have to know what you don't know. And in the ER, it was still like, hey, we need to figure out a path forward. Like, say someone came into the ER with abdominal pain. I don't like look at them and say, with my x-ray vision, like, hey, you've got appendicitis.

SPEAKER_00:

Right.

SPEAKER_01:

Right. It's like, okay, let me ask you a bunch of questions. Let's hear your story, let's see what's going on, let's examine you.

SPEAKER_00:

It's an information gathering process. Yeah. Yeah.

SPEAKER_01:

And then I might say, hey, you know, this could be a number of things. Uh, you know, I'm worried about your appendix. Um, it could also be your ovary, this, that, or the other thing, right? And we're gonna do some tests to figure out what's going on. Maybe that includes a pelvic exam, some lab work, some imaging, yeah, whatever it is, so that they know what's coming, even though the diagnosis is uncertain at that time, right?

SPEAKER_00:

Yes, yes. And that's that builds trust and also gets you to the right answer.

SPEAKER_01:

Right. And sometimes there's never an answer, right? Sometimes you have someone who comes in with abdominal pain and all the tests are normal, and then they feel better and they eat some food, and then you send them home and tell them to follow up with their regular doctor.

SPEAKER_00:

Right, right.

SPEAKER_01:

And so we live like medicine lives with uncertainty all the time. And you can be confident about that uncertainty sometimes too. Like it's okay. It's not, it's not a bad thing, it's not a wrong thing that we're uncertain. It has to be communicated.

SPEAKER_00:

Yes. And but just to be clear, we're not talking about self-confidence. We're talking about confidence in a particular decision that was made or a particular answer that was given. Yeah. A diagnosis, right? So that's you know, to what degree do you think that that is like confidence as in to what degree do you think that that is correct or that is the right way to do things, or that is how it should be done, or whatever, right?

SPEAKER_01:

And when I read LLM outputs, like the answer is always very confident sounding, right? It's like, oh, here's your answer.

SPEAKER_00:

Yeah.

SPEAKER_01:

Um, you know, as you could type something in there, and you know, a 69-year-old man came with a history of heart disease, came in with abdominal pain and vomiting and a slight fever, and then it's gonna say, I think this person has appendicitis.

SPEAKER_00:

Not only that, it does this thing where you can sort of see on the top, it says thinking is like, and then you can expand and see what it's thinking about, and it writes out all the different different steps that it's taking, giving you the impression that it's actually reasoning through all of those things. Sometimes I would take it even further. Sometimes the LLMs will ask you questions, right? Follow-up questions. Well, in order to really understand this, I need to understand this and this. And it comes back and asks you questions, much like what you described before, in terms of getting at more facts, more data, so that you can make a more informed decision.

SPEAKER_01:

And it may all it may come up with a list too, like I was naming off. Sure. Um, you know, it could be these potential diagnoses, but whatever it comes back with, it sounds very confident in that answer. Oh, yeah, yeah. I mean, it's which makes me feel confident in the answer too. Like if I'm looking something up, and this could be on a regular LLM, or I use a clinical one called open evidence, yeah, and I read the answer, it's like, hey, which approach is um better for spinal surgery, and anterior and posterior or posterior alone? And it'll it'll answer it. It'll say which one is better. But there could be nuance there, right? Yes. There could be nuance about, okay, this one might be better in this situation, or here's why you might choose that one. This one is more complications, but has a higher success rate. So that nuance is almost sort of removed because you get this like binary answer.

SPEAKER_00:

Yeah, you almost want a curious, sort of inquisitive mind at that point to ask, like, okay, let me really understand if I'm capturing all the different dimensions here to make to give you the best answer. Like, that's the kind of thought process that other humans or informed humans would have, right? You'd be like, well, yeah, I'll often click in.

SPEAKER_01:

I'll click in to the paper in open evidence and be like, actually, I want to see what it says. Yeah. Because like the binary answer is nice. Yeah, you should choose answer A over answer B, but like in reality, there's nuance there. There always is. It's not, it's not an easy decision. These things are complex.

SPEAKER_00:

Yeah.

SPEAKER_01:

And so I want to understand the nuance before I move forward. And I want to be able to explain that to a patient, too, right? Is that let's say there are two options and neither of them is wrong, right? Either of them could be fine, but we have to choose between them. I want them to be able to weigh the pros and cons, not have this, yes, this one is better, because that's not really that's like paternalistic. Yeah. Uh if something is clearly better, then I'm I'd be happy to recommend it. But sometimes there's not an easy answer.

SPEAKER_00:

Yeah, exactly. And this paper took that sort of AI is confident in the way it says things to actually asking it for confidence numbers.

SPEAKER_01:

Yeah, I thought that was very interesting. So they they basically went in um to a bunch of like licensing exams and they took questions from there and they rephrased them using LLMs, obviously. It's like AI on top of AI. Right. And then they had these models answer those questions. And alongside the answer, they said, okay, how confident are you in that answer?

SPEAKER_00:

Oh, they also, by the way, I believe rephrased the question in different ways and then sort of averaged out the confidence answers that it gave. Oh, with the rephrasing. With the rephrasing. Yeah, yeah, yeah, yeah.

SPEAKER_01:

So what was interesting, so like, you know, it's almost like take a guess. You know, do you think that it was more confident in the correct answers or the incorrect answers?

SPEAKER_00:

I mean, presumably it should be confident in the correct answers, right? That's what I mean, you the things that you believe to be really confident, I don't know, it could go either way, but I suppose I'd be inclined to think it's I'm I'm pretending to play the devil's advocate here. But in reality, that's not what happened, right?

SPEAKER_01:

In reality, even the most accurate models didn't have much variation in terms of confidence between the right and wrong answers. They like didn't know if they were right or wrong. Or what is like hold on, what is like knowing confidence for an LLM?

SPEAKER_00:

Okay, so we're we're going down that path now, I guess.

SPEAKER_01:

I think we are. Wait, wait, I I want to like drive home the main point of the paper, which is that the models were confident, had some sort of confidence no matter what, whether the answer was right or wrong, their confidence was about the same. And um and there were actually some models that were like less accurate and they were more confident.

SPEAKER_00:

Wow.

SPEAKER_01:

Which is crazy.

SPEAKER_00:

Yeah.

SPEAKER_01:

But now we can go down the rabbit hole of like, what does it even mean to ask? Like, if you asked me how confident are you that I have appendicitis, I'd say, oh, you know, I I probably make up a number. I'm not a very uh it's like hard to put that in, but I'd say, oh, you know, maybe more than 50%. But once we do the scan, we'll we'll be able to be more confident about whether that's your diagnosis or not. Whatever. Yes. So here, what is that, what does that mean? Like, what is does an LLM even like understand how confident are you?

SPEAKER_00:

So let's again do the do the thing that I've been I do I do often on this podcast, which is sort of explain an LLM real quick, just so that we get everybody on the same page, which is an LLM essentially complete sentences. Large language model. Yes, large language models are uh are neural network systems that have been trained on all of the internet's text, and they are being trained for the specific task of saying what is the next word given a sequence of words. So what they have learned is when you're given this sequence of words, what pattern exists that suggests the next word? And don't take the word word too seriously here. It's just next thing, right? And and you can think of it as words, that's fine. But the idea is that it's predicting the next word, and then what they do is they take that word, slap it on to the input, and then predict the next word, and then predict the next word. And that's how it builds up sentences and builds up everything else. Um so you know you could say like she was very hungry, so she really wanted something too, and then eat, eat. That's the most likely next word, right? But if I said something else, she wanted to dance, that's unusual and weird, but not crazy. It's not it's not grammatically wrong, but that's what a human would do. So there's a lot of you know, humans might be saying, setting up a sentence that she was very hungry, she wanted to dance. You know, maybe that's maybe that took her mind off of her hunger or whatever, right? I mean, it could be part of a bigger story. So the point I'm trying to make here is that the LLMs are trained to predict the next most likely word across all of human speech. It's remarkable that they can do what they're doing right now, and all the different use cases that we have for LLMs just suggests to me that our human speech somehow encodes all of that useful information. Yeah. But they've also been used for questions, medical questions, because they seem to have certain factual knowledge. You ask it what's the capital of France, it's likely going to give you the right answer. It's likely going to say Paris, right? Right. Uh, it might even give you some nuance. Maybe it'll say, you know, Paris is the capital now, but some other place was the capital, you know, uh 300 years ago or something, right? Or it's part of a different empire or something. But like the point is it has some knowledge in there as well. So there's knowledge and surprising reasoning capabilities to some degree, which we'll talk about in a second. And so it's not surprising that you could ask it diagnosis type questions and assuming that those patterns exist, it's going to whip out a series of words that look like what one would say when asked when what uh one would say when asked a question about medical diagnosis. So, like, should I take Tylenol for something, something? It might say the next thing that someone would say given that question. Now that someone is somehow a weird collection of all of human knowledge. So we don't know who it's not like it's not like a doctor or it's not an expert or a non-expert. But it's it's that's the issue. So now the question is, how is it coming to that conclusion? Well, when it was trained, it it you create this giant neural network with a bunch of knobs that are all learned and trained, and the the settings of all the knobs are adjusted. And during when you ask it the medical question, those knobs are not changed. It just goes through and does a bunch of math and produces the next word, um, which represents the probability of the next word given the sequence of prior words. That is not reasoning necessarily. That is not taking into account um like what you just described and you do a diagnosis where you go out there, you know, ask a couple of questions, do a couple of labs, think about what that meant, and so on. Now, I even when this LLM is doing reasoning or thinking, when it's in thinking mode and tells you its own reasoning, it's just spitting out more words. So I've always had this position, and it's a controversial position. It's not like everybody agrees with me on this, but the position is that um LLMs are um reason they their uh reasoning is not real reasoning like humans do. It their reasoning has the appearance of it. They do reasoning like we do in general terms. Like if one were given this is the type of reasoning one would do given a question about this and this. So it's just producing more words, which then leads to more words, which then leads to more words. So in this context, think about the question of how confident are you?

SPEAKER_01:

Is it actually like the confidence in the answer, or is it just like answering a question? Because there's a question in front of it.

SPEAKER_00:

Well, it's not just that, right? It's saying it is taking into account what it what it might have said before. But again, those are all words. They don't represent the thought process that went into figuring out the diagnosis. Right. If you had to take, you know, if you had to receive um a lab report and maybe do something else and ask the patient something, those are all little pieces of evidence towards a larger uh hypothesis that you might have for a particular disease that they're building up, right? So that hypothesis is in your mind, and you're asking all these different questions to get at different pieces of evidence for and against that hypothesis, so that you can then suggest, okay, this suggests that it might be the case that you have this disease or whatever. Right. That kind of process is and so this uncertainty and the confidence levels for that process is not taken into account by LLMs. That's not how it's doing confidence measures, right? You're saying, look, I got this measure from the MRI machine, and the MRI machine, you know, um gets it right most of the time when I've when I've used it. So I can trust it, right? And so you're taking into account things like trust and all of the other stuff, which it's not doing when it's computing its confidence measure. And not only that, that that confidence measure is a what they call post hoc confidence measure, which means it's already done the reasoning to arrive at the answer, and then you're asking it for its confidence measure. So it is not retrieving why it arrived at that answer in the first place. It's just saying more things. Uh-huh. It's not like it doesn't have a memory. So it's not looking back and saying, oh, that is how I computed this and and therefore my confidence should be this. And even if it, even if that whole stack of reasoning was on the page, like if it said every every single step it did, those are just words once again. When it's computing the reasoning, it's computing its act of looking at those words, not the original act of generating the steps of reasoning that it did. Does that make sense?

SPEAKER_01:

Yeah, it totally makes sense, but it's like wild because it's like the question that you're asking. See, you ask it to answer a medical question, yeah. And then you're like, how confident are you in that answer? And it and it like spits out something, but it it's not it's like not fully related to that first question it's answering.

SPEAKER_00:

That's right.

SPEAKER_01:

It's just like, here, here's a number. It's like you can't trust the confidence in the answer at all. And so you can't trust the answer.

SPEAKER_00:

At all. That's correct.

SPEAKER_01:

But the answer, but the answer sounds good. This is what gets me. It's like the answer is so fluent. And I think one of the things they pointed out in this paper is like, hey, if doctors are using these and something sounds really good, then you're just gonna like go with it. Well, that's you don't have time, if you're like, I don't have time to do all this research, I don't have time to look it up. It it sounds right. It sounds good, I'm gonna keep going.

SPEAKER_00:

Yes, and there's a serious concern about degradation of um uh expertise, medical expertise. Because if the moment you start relying on these more and more, you're not gonna use your own thinking, your own training, and you're gonna rely on that less and less. And over time, that can impact your um your the quality of care you provide.

SPEAKER_01:

Yeah, it's interesting because when I practiced, it was like you need to know what you don't know. That was like a very important line. Yeah. It's like you need to know when you need to call a consult. Yeah. Um, you know, you have a sore throat who's in the hallway, who's not even able to swallow their saliva. Let's do some imaging. And we need to call the ENT person to take a look down that person's throat and see what's going on. I don't do that. That's something I need a specialist to do. Yeah. Right. Or I'm worried about this ingestion, I'm gonna call poison control. You need to like know when to call for backup or when to call a consultation.

SPEAKER_00:

Yeah.

SPEAKER_01:

When something is like beyond your level of expertise. But if you think that you are an expert because you've looked it up and the answer is very clear, yeah. Even if it's wrong, it makes you feel like it's right.

SPEAKER_00:

Yeah. And another way to frame this, I think, is also thinking about metacognition. This is the idea of how you can think about your own thinking. And we're able to do that. We're able to like, you know, think about how we thought about something and then improve that and change that. Um, maybe we'll strip, maybe we're still learning. Maybe our thought process about how to arrive at a diagnosis was not quite right. So next time we learn how to do that correctly. But there's a there's a metacognition aspect, which is what you need to derive a confidence value. And there's been studies um on LLMs on this, on this point. In fact, there's one that's pretty recent that talked about LMs don't know how to say I don't know. They don't, they don't have a good measure um of their own level of uncertainty, like we just talked about, but they don't even admit it. And they there's no, there's this, you know, well, they're trying to be helpful, right?

SPEAKER_01:

And so they think they're not helping or think, think. Yeah. They think they're not helping if they don't give you an answer.

SPEAKER_00:

Yeah.

SPEAKER_01:

And like a clear, distinct answer. But the problem is that in medicine, there aren't always clear, distinct answers.

SPEAKER_00:

That's right.

SPEAKER_01:

And that nuance, being able to understand that nuance and communicate that nuance is important. There are times when there is a very clear answer, like, oh, you're having a heart attack. We need to take you to the cath lab and put a stent in, right? Like, okay, you know, case closed, done. But there are lots of times where there's nuance in trying to figure out we don't know the answer, what's the next step? Or there could be different ways to treat something. Um, and there needs to be a discussion around that. There needs to be, we need to be okay with that uncertainty. And it's like LLMs are not okay with that. They are, they, they need to have that confidence, they need to display it. And then when you read something that feels like it's so, I don't know, they are ready, binary, confident, decided on. It's like you feel like you want to follow it. This is, by the way, the tactics that we've talked about diss and misinformation before, right? A lot of times they use these same tactics where it's like, oh, we have the answer. This is the answer. Uh and it makes it sound really good. Yeah. But that doesn't mean it is good.

SPEAKER_00:

Yes, exactly. Yeah, it's a hard, it's really a you know a problem of. I also think that it's a problem of what LLM should and shouldn't be used for. I mean, in domains like coding, programming, um it's very precise because you your code runs or it doesn't. And if you create uh test cases, then your code passes those tests or not. It's very, it's very precise. So you can have some degree of confidence, overconfidence, all that doesn't matter because at the end of the day, the LLM has to produce code that works, right? And so it kind of works in that domain, it works really well in that domain. Um, but there are more, there are more gray area domains like medicine and law and these other places where there's a human element, there is um unknown factors, there is, you know, um information that needs that that is not available yet, but unless asked for. And so the person who's kind of doing the uh the analysis should. Think about what they need to make a better determination. And there's a lot of, I mean, that's research at some level. And to be able to do that effectively, um, and then assign confidence because you've somehow, you know, you can think about how you did that and you know the places that are the weakest links, and and you know that you can identify that and quantify that, then that's fine, you know. And it's not always the case, and definitely asking an LLM for confidence numbers in the grade. There's um some work on quantifying some aspects of um uncertainty as well, which is specifically that the LLM, like I said, produces one word at a time, right? What it actually produces is not the next word, it produces a what they call a probability distribution over all possible next words. So there's a vocabulary that it can derive from of all possible next words, and it it creates a um a ranking for them based on how likely it is that the next word will be this. Um so she was hungry, so she will eat, right? That's an example we had. Eat would be like have a 99% chance that it's the next word. But there's all the other words that have some probability measure. And so that number people have used as a way to say, I'm more confident that that's the answer. But again, you can see the problem with that, right? That is purely based on the pattern of language used before to suggest that this type of language is going to follow next. That's got nothing to do with the content of what's being said.

SPEAKER_01:

The accuracy.

SPEAKER_00:

Yes, or the content of what's being said. Yeah. Right? I mean, again, this is why it's controversial because some might argue that the LLM is in fact thinking about the content, otherwise, it wouldn't be able to predict the next word correctly in the first place. So it wouldn't be able to make that judgment because uh, but you know, that's unknown still as to what the LLM is actually thinking. Because no one can right now effectively scrutinize the numbers inside. And there is some work on that too. Um, but no one really understands where ideas, concepts, knowledge, all of that lives inside of the LLM's brain. And just asking it uh for confidence numbers, I think that is really not going to do you any good.

SPEAKER_01:

So going back to your initial comment here, which is, you know, we need to make sure we use these things judiciously and understand like what the boundaries are. It's like very interesting because we're now at the time where these LLMs are releasing health models, right? Like ChatGPT Health is coming out. I'm on the wait list. Are you on the wait list? I am on the wait list. Okay, we don't have access yet. But now you're talking about getting the same information in front of patients. Yes. The same very confident information that may or may not be true.

SPEAKER_00:

Yeah, yeah. And the claim is there are guardrails that they're setting up for this sort of thing, um, or questions that it won't be answering, questions that it will be answering. I don't know. That's going to be interesting to see what happens with that, but just don't ask it for how confident it is about something. Because you will get an answer for sure. But just there's no reason to think that that is that answer actually corresponds to what you're looking for, um, in terms of the confidence of how it arrived at that answer.

SPEAKER_01:

Yeah. Wild. It's wild. Um, all right. Well, I mean, I think this idea of competence and this concept of uncertainty are really important. Yes, both in terms of LLMs, but in health and medicine in general. Yeah. I think we need to know, like, you know, we need to be okay with some degree of uncertainty.

SPEAKER_00:

Yeah.

SPEAKER_01:

And whenever something feels like, oh, it's it's certain, it's definite. We need to like sort of take a step back and be like, but is it?

SPEAKER_00:

Yeah.

SPEAKER_01:

Are we sure about that? How certain are we? How confident are we in that answer? Yes. Because there's usually a lot more when you start to dig underneath. Yes. And of course, that makes it more work. It makes it more complicated. But there's the these are these are complicated topics. That's why people go to school for years and years for these things. Yes. Is that there's uh sometimes there's an easy answer. I love those cases. Like you send me a case at nursemaid's elbow, and I pop it back in and you're out on your way, and everything is hunky dory. But on the flip side, more often there are cases that are complex. Yeah. And the answers require nuance and uncertainty. And you it doesn't mean you don't make decisions or push forward.

SPEAKER_00:

No, no, but you want to have explanation, you want to have accountability, most importantly. Like after the fact, if something goes wrong, at least you know you have a trace for how you arrived at that and you want to be able to, so you can change it the next time or have be accountable for the for the decision you have made.

SPEAKER_01:

Yeah. I mean, I think back to some of the case studies that I read in the New England Journal of Medicine. It'll be like, oh, it was the patient's like third presentation because they had something weird going on. And like the first two times they had a slight workup, and then the workup got bigger and bigger. That that was common in the ER where you say, Oh, like if someone's bounced back a second time or a third time, like now we need to do more because we didn't like something's not clicking, something didn't get figured out that first time. And that that that's that was like part of my training, and it always sticks in my brain. Yeah. Um, because it's not always like an easy and straightforward answer as much as we might want it to be.

SPEAKER_00:

Yes. Yes.

SPEAKER_01:

As much as LLMs tell us it is. Yes. Yeah. All right. Well, we can wrap there. I can stay confident that I will that I will have a that I will have a job, but I am very curious to see how confident I will feel after that when we get to try Chat GPT Hub. Thank you for joining us. We'll see you next time on Code and Cure.