Code & Cure

#31 - How Retrieval-Augmented AI Can Verify Clinical Summaries

Vasanth Sarathy & Laura Hagopian

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 23:38

Fluent summaries that cannot prove their claims are a hidden liability in healthcare, quietly eroding clinician trust and wasting time. In this episode, we walk through a practical system that replaces “sounds right” narratives with evidence-backed summaries by pairing retrieval augmented generation with a large language model that serves as a judge. Instead of asking one AI to write and police itself, the work is divided. One model drafts the summary, while another breaks it into atomic claims, retrieves supporting chart excerpts, and issues clear verdicts of supported, not supported, or insufficient, with explanations clinicians can review.

We explain why generic summarization often breaks down in clinical settings and how retrieval augmented generation keeps the model grounded in the patient’s actual record. The conversation digs into subtle but common failure modes, including when a model ignores retrieved evidence, when a sentence mixes correct and incorrect facts, and when wording implies causation that the record does not support. A concrete example brings this to life: a claim that a patient was intubated for septic shock is overturned by operative notes showing intubation for a procedure, with the system flagging the discrepancy and guiding a precise correction. That is not just higher accuracy; it is accountability you can audit later.

We also explore a deeper layer of the problem: argumentation. Clinical care is not just a list of facts, but the relationships between them. By evaluating claims alongside their evidence, surfacing contradictions, and pushing for precise language, the system helps generate summaries that reflect real clinical reasoning rather than confident guessing. The payoff is less time spent chasing errors, more time with patients, and a defensible trail for quality review and compliance.

If you care about chart review, clinical documentation, retrieval augmented generation, and building AI systems clinicians can trust, this episode offers practical takeaways. 

Reference:

Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records
Philip Chung et al. 
NEJM AI (2025)

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

SPEAKER_01:

Trust is everything in healthcare. So, how do we build systems that don't just generate clinical text, but actually prove it's true.

SPEAKER_00:

Hello and welcome to Coding Cure, the podcast where we discuss decoding health in the age of AI. My name is Vasan Sarathi, and I'm an AI researcher and cognitive scientist, and I'm here with Laura Hagopian.

SPEAKER_01:

I'm an emergency medicine physician and I work in digital health. Very cool. We're gonna talk about RAG. We're gonna talk about chart review. Chart review. Isn't that so fun?

SPEAKER_00:

Yes. But also talk about RAG.

SPEAKER_01:

I know, I know. Um well, chart review is maybe not the most fun part of uh of a clinician's job, but it's definitely necessary.

SPEAKER_00:

What's chart review?

SPEAKER_01:

It's like looking back into a patient's chart history, like all their information in the computer and just trying to like get to know them a little bit.

SPEAKER_00:

So this is what a doctor would do when they at the beginning of their appointment with you. Kind of look back and say, okay, here's all the here's where we were. Here's where we left off.

SPEAKER_01:

Yeah. I mean, you could you could review a chart in a lot of different situations, like, oh, you're a recovering physician and someone called you and they want a refill of their medication. And so you need to just like understand, hey, what's this person's background? What, what are their health conditions? Uh, what medications are they taking? Or maybe you want to review um if you're a primary care provider, like this person was in the hospital and they had to have a surgery done, and now they're here for their follow-up appointment. And so I want to understand what happened during the course of their hospital stay. So there's a lot of like there's a lot of really important information in there that you need to know. Yeah.

SPEAKER_00:

Yeah.

SPEAKER_01:

And at the same time, you're like, well, I don't want to spend 20 minutes doing it. I want to go and see the patient. And so we've talked about summarization before, but it wouldn't it be nice if there was some way to automate this chart review so that you could get a a good summary of what happened with the patient, for example, you know, during a hospital stay, or if you're a recovering physician and want to know, hey, hey, what are the basics about this patient? Yeah. What's the information I need to know? But of course, you don't want, you don't want false information there.

SPEAKER_00:

Well, that's it, right? So you have AI tools now that can read the charts and produce a summary for you, right? There's all these summarization tools that we talked about before. But then the question is, how do you know what's right and what's not? Because we know for a fact that they make mistakes, right? They hallucinate, they hallucinate, which is just a fancy way of saying they make up make up stuff, they um leave things incomplete, they add things that were that were not there. There are errors, and they change things, right?

SPEAKER_01:

And then that becomes like a part of the person's chart. So then that gets carried forward forever, right? Which is a huge problem.

SPEAKER_00:

Because if it's not, even if it's not part of the person's chart, it's still providing the wrong summary, and that's being used by the doctor to act on, you know, in whatever capacity afterwards, right?

SPEAKER_01:

So you need the right context in order to craft the right decision for that patient. Right. And there are times where it may not matter as much, but there are times where it could matter a lot more. Yeah. And so I I think I've said this before, but like if I had one false summary, I would just be like, I can't use the summary tool anymore. It's no good because you can't trust it. Yeah. And so we need to be able to, if they if we're gonna have a summarization tool, which would be super useful, we need to be able to trust the output of that tool. Yeah.

SPEAKER_00:

And people, the way people try to do that, you know, uh, is basically before you know, before this discussing the specifics of this paper, um, the way people try to do that before was they would try to add prompts in the uh tool, the AI tool, to ensure that it only pulled stuff from the charts and that it didn't make up anything else along the way, and that everything it did, it revisited it and double-checked it before it produces the summary. So there are techniques that people apply to make sure that whatever it's producing is within reason and not just within reason, but accurate. And um, and so those those tools exist, but they're still issues, right?

SPEAKER_01:

Yeah, they're still not perfect. And so they say, Hey you, hey you, Laura, you're the the human in the loop, which I kind of think it's not great in this situation. Um, because you're trying to find like the needle in the haystack, you're not able to necessarily find it every time. You have to be like on all the time, on alert, trying to figure out, oh, you know, is this subtle thing wrong? And having to go back and reference it in the chart, which takes more time.

SPEAKER_00:

Yeah, yeah. I was gonna say that's like actually more effort than writing the summary yourself. Looking at the charts and writing the summary first hand by yourself is potentially easier than taking an LLM output and then checking if it's actually correct or not or wrong, right? And yeah, exactly. Yeah.

SPEAKER_01:

And I mean, you could also have like wrong things and right things together in the same sentence. So now you have to like what analyze every single word? There's no way. Like, there's no way that a human in the loop is the right way to solve this problem. Yeah.

SPEAKER_00:

And in fact, I just want to take a pause for a second because people might object to this idea that somehow evaluating an LLM answer is easier or harder than doing all the summarization yourself. And I'm saying that only because it I don't know if it's actually easier or not, because I don't know what the actual data is on that. But it the issues that one would have in order to evaluate the quality of an LLM answer requires not only checking all the facts, but also not being swayed by the the natural linguistic fluency of the LLM itself. So it's very easy when you read something that looks very convincing to kind of buy into it first, and then challenging that becomes really hard for any human to do. And so that's why it's a hard task, right? It sounds good, so it must be right. Yes. And then you have to really overcome it and say, okay, what could be wrong here? Because anything could, everything, anything could be wrong here. There's not like a simple, simple, like it's not like you're training uh somebody who's never done summarization before, another human, I mean, um, to do this because you might be able to guess, okay, another human might make these sorts of mistakes, an inexperienced human might might miss these things. You can teach them that, but that's not how AI systems work. So now you have an AI system that's producing an output, and there could be an error anywhere. You have no idea. It could be in the smallest of places, but it could be really meaningful.

SPEAKER_01:

Well, that's that's like I'm thinking about an example in my head, like, oh, you have a 49-year-old male with a history of hypertension and hyperlipidemia who presented with shortness of breath, dizziness, nausea, and vomiting. Now you have to like go and check each of those things, and that's like a common sentence, right? Did they have nausea? Did they have vomiting? Did they have shortness of breath? Are we missing like a complaint of tingling? I I don't know. Like all of those things you have to go and verify. And that's like not tenable as far as I'm concerned. That's not like that's not a that's not a job that I would be able to do all the time, nor would I want to. I would say, oh, that would just be easier for me to read the chart and summarize myself. Thank you.

SPEAKER_00:

So so it's not solving the original task that it was supposed to do, which is summarize, make, make the life of summarization easier, right?

SPEAKER_01:

Yeah, exactly. It's like so that's I want my life made easier, not harder.

SPEAKER_00:

Yeah, I mean, the whole point of giving you that tool was to make it easier, not to not give you additional issues that you have to deal with, right?

SPEAKER_01:

Or make me have trust issues with the LOM.

SPEAKER_00:

Right, right. Which you will, as soon as it gets something wrong, it's done, right? Because there's no way for you to fix it.

SPEAKER_01:

Yeah, exactly.

SPEAKER_00:

Right. So this paper came out and said, okay, let's find a different way of fixing it. And they used a technique called RAG, right? So that's kind of and RAG is is is is an acronym for retrieval augmented generation, which has been a uh a technology that's been around for a little bit now, that is an extension of an LLM. Um, and people have been using it quite a bit for reducing hallucinations, improving facts, and so on. There are still some open questions about it, it's not perfect, but in this particular application, they were able to actually effectively use it in a nice way. And um, and they used a couple of other cool ideas here too. But I thought it was really, really well done in terms of just a system that works now, you know, better than it did before. Um, so yeah.

SPEAKER_01:

So I So tell me, tell me, tell me what is like what is what does RAG mean? What does it actually mean?

SPEAKER_00:

So LLMs, again, going back to the uh world of LLMs, um, is a an AI tool that produces text. So you type in a prompt and it produces more text out. In other words, what it tries to do is it it it can completes your sentences, it it adds it predicts what you're gonna say next, it predicts the answer to questions and it's able to do that kind of thing. And so but the problem is in order to answer a factual question, it uh if you give it just the LLM, then it's relying on the knowledge that already that it already has. And if you think about an LLM and its training, all that uh the knowledge that it already has is based on um all of the training that happened before the LLM came to your use. So things like you know, the companies like OpenAI and all these other companies, they pass the LLM through trillions of words of text from the internet and they train it how to predict the next word. And that's kind of the knowledge that it has. It has generalized world knowledge, right? It knows nothing about the specific patient at on hand at all, because there's no reason it would. That's not how it's trained. Um, but it's very good at reasoning about things and it's very good at like completing sentences and those kinds of things. It's good at answering generic questions like what is the capital of France, but it it might not be as good at answering very specific things. And in fact, it's it's it's dangerous because if you give it a specific piece of information and you ask it an answer, it might give you the generic answer, which may not be true for that person, right? For that patient. So um so that's kind of the world of LLMs. So you can't like completely rely on it for like specific factual information. So what people did was they said they developed a technique called retrieval augmented generation, which is a way of saying, look, I have this bunch of facts, I have this like dot set of documents that have all the correct answers for this specific patient, right?

SPEAKER_01:

Okay, so that's their chart. Like their chart has all the correct answers. That's your source of truth.

SPEAKER_00:

I have this thing. Now, is there a way for me to encode this thing so that when a question is asked, I'm able to pull out relevant excerpts from this thing and then put that in my prompt and then have the LLM do its normal thing.

SPEAKER_01:

Okay, so the summary that we're getting is is being checked against the source of truth, which is the chart.

SPEAKER_00:

It's not being checked. What's happen just the retrieval augmented piece of it is not the checking. There's no checking. Oh all it's doing is let's say I ask a question about something very specific. Like how many um why was the patient intubated? That's an example that's in here. Oh, why was the patient intubated? Yeah, right. So it would have to go to the patient's records and figure out all of the snippets of uh facts that are present in the charts. Remember, the charts we're assuming are facts, right? There's nothing wrong with them. They're gold, they're like gold standard facts, right? And so they're ground truths. And so the retrieval piece will go and find everything that it deems relevant to this question, right? And it pulls out all the facts that it deems relevant to this question. And in a traditional retrieval augmented generations process, you would take all of those facts and you would dump it into the prompt. Now you have a question of why is the patient intubated. I know you know, colon and then all these different facts that you just found.

SPEAKER_01:

Okay, so like they were intubated because they had a procedure done.

SPEAKER_00:

Well, maybe there's a whole bunch of things that were retrieved from the actual facts that are now explicitly present in the context, and then the LLM is allowed to answer based on that. That's typically how retrieval augmented generation works. Okay. So now the LLM is not relying purely on its previous knowledge, it's given specific information and it's told to ask answer the question in this context. So its context has been narrowed a little bit.

SPEAKER_01:

Yeah, well, that makes sense. You don't want it like pulling from all of the internet, you literally want it pulling just from that individual patient's chart to answer the question. Because the reason like you might get intubated is very different from the reason someone else might get intubated. You don't want it, you know, conflating those two things.

SPEAKER_00:

Exactly. And even this approach, right, has it has issues because it's not quite the same, it's not quite a choice, but the LLM still has a choice of whether to use it or not. And you know, it can still ignore what you just said and answer the question more generally, right? It it's not when I say it's making a choice, it's not actually making a choice. It's sort of a quick point. But it's still predicting the next word, right?

SPEAKER_01:

And maybe it's 99% that they got intubated for a procedure, but then what if there's like a 1% that it's something else?

SPEAKER_00:

Yeah, or or it's just not, it's just ignore there's there's been some studies done in the past where there've been some studies done in the past where the um depending on how confident the LLM was before the extra facts were inserted, um, it'll either ignore it or consider the extra facts. So if it knew an answer to a question confidently, it'll just ignore what the facts are, extra facts are to answer the question at the time.

SPEAKER_01:

But what is confidence, anyways, for an LLM? Oh, I could go down a big rabbit hole.

SPEAKER_00:

But the point is there exists the possibility that it ignores the ignores the facts.

SPEAKER_01:

Okay, great. That still exists. So right, so but it's better. It's it's better because it's like pulled the information out than it should be following it.

SPEAKER_00:

And there are prompt techniques to make it really like force it to use that information, right? So like there is ways to make it really, really, really force it to use that information. So then you get more accurate pieces of information. In this case, where you wanted to draft a nice summary, what you would do is you would have a retrieval augmented system check against whatever LLM system generated that summary in the first place. So you would have so their system that they propose in this paper is a verifier of sorts.

SPEAKER_01:

A very fact.

SPEAKER_00:

A very fact, yeah.

SPEAKER_01:

So you like the name or verifies the facts.

SPEAKER_00:

Yeah, regardless of what your um LLM that is summarizing is if you're given some in you've given you're you're given some summary, whether it's human generated, LLM generated, it doesn't matter. You're given some summary. What it does is says, okay, let me take that summary, let me break it down into all the facts you're claiming in this summary. All the things you're saying are fact, right? I'm gonna just list it down. Then I'm gonna use my retrieval engine to go and look for um relevant facts in my in the patient's charts. And then I'm gonna bring those two pieces together and then I'm going to rank them. I'm gonna decide which ones are important, then I'm gonna present all of them to the to another LLM who is gonna pretend to be a judge and evaluate are these facts that you say in your summary consistent with the facts that I'm showing you from the um from the uh from the charts.

SPEAKER_01:

And there are two separate LLMs.

SPEAKER_00:

Yeah, there's potentially more, right? Doing different things. Like the retrieval engines might be using LLMs along the way. But like the key piece, yes, the key piece is it's a separate LLM that is a judge. It's the the the the paradigm is called LLM as a judge, in which the LLM is used for evaluating something, and you're so you don't have to have the clinician doing it.

SPEAKER_01:

You like they're trying to find the LLM is trying to find the needle in the haystack.

SPEAKER_00:

Yes, yes. And all the LLM is doing in this particular case is taking the summary, breaking it down to all the facts, checking if the facts are correct, and if not, flagging them, like saying they're not supported or they are supported or whatever, right?

SPEAKER_01:

I think this is one of the things that's interesting to me about this is that it's not like, oh, this sentence is true or false. It's like breaking the sentence down. Oh, you know, you have a 77-year-old male who presented with nausea vomiting and was found to have acute cholangitis. There's a ton of now there's so much information there. So now it's like verifying did they have nausea, did they have vomiting, is the age correct? Um, you know, what was the diagnosis?

SPEAKER_00:

Is the gender correct? Is the, you know, uh yeah, everything.

SPEAKER_01:

And so the output of it I thought was interesting too. And they gave some examples here where they said, okay, here was here was the proposition from the LLM, the initial LLM, the summary. And then what were the what were the facts in the reference document? Yeah. And then what's the verdict? Was it supported or not supported and for what reason? So now you can now you can bring me in. Now you can bring me in to decide. Like, oh, you know, the proposition, one example they give in the paper, the proposition was the patient was intubated due to worsening septic shock. But then when the they went back to the electronic health record, yeah, it said, hey, the patient was intubated for the procedure. And so now the verdict in this case was that it was not supported, the original thing was not supported. They were not intubated for septic shock. They were in fact intubated um for the procedure. And so you can come back to me, the clinician, and be like, hey, what was the what what's what's truly the right answer here?

SPEAKER_00:

Yeah. And and honestly, as a clinician, you might not care about any of this. All you want is the summary, right? At the end of the day. At the end of the day, yes. But I want it to be, I want it to be an accurate summary. Right, right, right. So ideally, you'd want the LLM judge to also fix the fix it so that it's it's saying the correct facts as opposed to just telling you that there's mistakes here, right? It doesn't care that all it's doing is telling you that there are mistakes in this other tool that it's found. And instead, what you want it to do is to say, okay, just tell me the answer. What's the, you know, what's the what's the final verdict? Yes. Yeah, yeah. But at least you have some degree of confidence that here, okay, this is what the reason this is the correct summary is because I have specific facts in this uh, you know, excerpts from the uh patient's records to show that these things are all supported.

SPEAKER_01:

And each of them are verified. Because I think it's very easy as a clinician to look at a sentence that sounds reasonable and be like, yeah, that sounds reasonable. They had nausea, vomiting, and abdominal pain. Yeah. Whereas what if they just had nausea and vomiting and not abdominal pain? Yeah, I I you know, I I don't know, like it's totally possible that that is how someone presented, but if I read that sentence, I'd be like, well, that sentence sounds good.

SPEAKER_00:

Yeah. I I think that there's a lot more so for me, this paper actually hits on something that I is a core research element of my own work, uh, which is argumentation, which is the idea that humans, when you communicate with each other, we're essentially making arguments. We're claim making claims, we're providing evidence, you know, whatever we're saying, we're sort of making these like argumentative like structures. And that's kind of what's happening here. You're making all these claims in the summary that are facts, and you're kind of stringing them together and you're making an argument for why this is the summary, right? Why this is the case of uh why the patient is this way. And and so to me, this is a very interesting starting point because there's so much more here, right? It's not, it's not, it still might be wrong. Even if every single fact there is correct, if the facts are written in a way that suggests that one fact is um is supporting another fact when it might not be, when they might be unrelated, right? Or if it's written in a way where you're you're piecing together an overall argument that has all these implicit assumptions that are not true, right? Uh it may be the case that's, and it's also more nuanced than that, right? Because it might be the case that you might have a claim you're making, and really there's four facts that it's supported, but there's a couple of facts that could contradict it. But you as a as a clinician might say, you know what, it really is supported. It's not these other things don't really always matter. You're hoping that the LLM judge is making that call correctly, right? It might not be. Yeah. So there's more nuance than that. There's nuance, there's nuance there. Yeah. So that I think to me, from a research standpoint, is very interesting because humans make that kind of uncertain judgments all the time. We have to. And we evaluate not just the veracity of a claim, but we're able to evaluate the structure of the argument being made. Just because you say two things, true things, uh, two things that are true, it doesn't mean that one thing caused another thing. Right. Right. That might be implied in the way you said it, right? I could say, you know, I don't know, the sky is blue, therefore my dog is barking. That is a that those two things might be true, right? But I I implied the therefore, in fact, in this case I was explicit about it, therefore, but I might not have been explicit. But I implied the existence of an argument there, an existence of a of a support claim there, which is not true. So, like things like that are you know, I think are the next thing. Yeah, so it's very to me, that's very interesting. I I get very excited about this.

SPEAKER_01:

But I want it to be perfect, yeah.

SPEAKER_00:

So, I mean, again, you want accountability. What you want is not just correctness, but you want some kind of auditability accountability to check later if something goes wrong, you have at least a trail to describe, okay, here is um the summary that was provided, and here is the uh the reasoning trail for how we arrived at this. And whether that's correct or wrong, you know, you can evaluate, but at least you you understand why it got there, right? And I think that's important for all of our AI tools going forward. We don't have enough of that. There's too much of black box AI jumping in, just doing everything for you right now without providing any of this kind of accountability, I think. So to me, that's why this work is also very exciting because you're moving in the right direction in that front.

SPEAKER_01:

Yeah, and I I completely agree because I think automating some of the more mundane tasks in a way that has accountability, but that you can move forward and actually spend more time with your patients, that's huge. That's gonna make people more efficient and being able to trust that output is something that we need not only to be able to trust the output, but to know where the output is coming from and to understand hey, this was verified. And for something that was not supported, there was a a reason or a verdict that came out of it. And I think this paper helps us move in that direction. It's something with very high utility for a clinician because Anytime you you do something that m that improves operational efficiency and lets people spend more time with their patients, they're going to be happy. Your providers are going to be very, very happy. And honestly, you know, chart review can be very mundane. You're clicking on a million different things, trying to get through all these past notes and and um, you know, maybe you're seeing multiple different summaries throughout the chart, having like a a really good source of truth for that that you can peruse and uh use to understand what just happened with that patient, that makes your job easier. So you can go ahead and now talk to the patient and make the decisions that need to happen.

SPEAKER_00:

Yeah, exactly. Exactly.

SPEAKER_01:

So thank you for teaching us about RAG and LLM as judge.

SPEAKER_00:

Yeah. No, I mean, and and so you know, people uh also, you know, wonder how can you use an LLM as a judge, right? I mean, so as long as you can break it down into these different functionalities, at least you have some degree of understanding of transparency as to what's happening under the hood.

SPEAKER_01:

You're like training it. Yeah. You're telling it what to do and how to judge, and you're telling it to judge the summary against the verifiable facts that are in that electronic health record for an individual.

SPEAKER_00:

That's correct. Yeah, yeah.

SPEAKER_01:

Amazing. Amazing. Can't wait to see this uh put into action in more places and um future iterations of it as well.

SPEAKER_00:

Yeah, me too.

SPEAKER_01:

All right, we will see you next time on Code and Cure. Thank you for joining us.