#13 - Can Machines Choose Our Diagnoses? Artwork

Code & Cure

Decoding health in the age of AI

Hosted by an AI researcher and a medical doctor, this podcast unpacks how artificial intelligence and emerging technologies are transforming how we understand, measure, and care for our bodies and minds.

Each episode unpacks a real-world topic to ask not just what’s new, but what’s true—and what’s at stake as healthcare becomes increasingly data-driven.

If you're curious about how health tech really works—and what it means for your body, your choices, and your future—this podcast is for you.

We’re here to explore ideas—not to diagnose or treat. This podcast doesn’t provide medical advice.

All Episodes

Code & Cure

#13 - Can Machines Choose Our Diagnoses?

October 09, 2025 • Vasanth Sarathy & Laura Hagopian

What if AI could turn chaotic clinical notes into clean, billable codes—without sacrificing accuracy or trust?

Every shift, emergency physicians face the same grind: time-crunched documentation, symptom-first note-taking, and the constant lure of the “unspecified” box just to move on. But what if a system could read between the lines—and suggest precise, payer-accepted codes grounded in real guidelines?

In this episode, we explore how retrieval-augmented generation (RAG) is reshaping medical coding. Laura, an emergency physician, shares what it’s really like to code in the middle of clinical chaos. Vasanth, an AI engineer, explains why standard large language models often hallucinate ICD-10 and CPT codes—and how RAG brings the conversation back to solid ground with verifiable sources, official codebooks, and audit-ready citations.

We unpack a recent study comparing clinician-assigned codes to RAG-augmented outputs on actual emergency department charts. The results? When reviewers didn’t know which was which, they often chose the AI-generated codes—ones that captured true clinical meaning, like “alcoholic gastritis without bleeding” instead of the vague “epigastric pain.”

Beyond accuracy, we dive into the ripple effects: cleaner claims, fewer denials, stronger datasets for research—and the essential guardrails that keep things safe and ethical, from privacy safeguards to human review and confidence scoring.

If documentation has ever pulled you away from patient care, this episode offers a hopeful shift. Learn where retrieval-based coding tools fit into your EHR workflow, how clinicians can stay in the loop, and which high-volume complaints to tackle first for maximum impact.

Subscribe for more deep dives into clinician-centered AI, share this with the colleague who always codes “unspecified,” and leave us your biggest documentation headache—we’ll decode it next.

Reference:

Assessing Retrieval-Augmented Large Language Models for Medical Coding
Eyal Klang et al.
New England Journal of Medicine (NEJM) AI, 2025

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

SPEAKER_01: 0:01

From scribbles on a chart to numbers on a bill, coding keeps healthcare running. And now large language models want in on the action.

SPEAKER_00: 0:19

Hello and welcome to Coding Cure. My name is Vasan Sarti, and I'm an AI engineer. And I'm with Laura Hagopian.

SPEAKER_01: 0:25

I'm an emergency medicine physician.

SPEAKER_00: 0:27

Today we're gonna talk about coding. And when I think about coding, I think about writing code, programming, Python, all the computer languages, but that is not what we're talking about here. We're talking about medical coding.

SPEAKER_01: 0:40

And not like the kind of coding you see on the TV show, like code blue, code code black, whatever. Code brown. No, I'm just just I'm just kidding. I think you can guess what that one is. We're not talking about like resuscitations or you know, a child lost in the hospital. We're talking about the coding that we use to bill. I know it's like a super sexy topic.

SPEAKER_00: 1:01

I know. It's uh, you know, it's interesting because I I um the word code has so many other meanings too for people, right? I mean, there's cryptographic code, the Da Vinci code, right? And so it has all these meanings, and oftentimes it's just associated with some kind of set of letters and numbers that signify something, which is kind of what this is as well.

SPEAKER_01: 1:22

Yeah, that's what this is as well. Although it's not like a super sleuth kind of thing. This is about um basically being able to communicate with insurance companies. It's it's like, oh, someone came to the emergency department with chest pain. Here's what happened with them. And so the whole idea is that you want to be able to translate the medical diagnosis into some sort of standard way or standard code or standard language that lets us then bill for it, that lets the insurance company know what happened. So you can like process a claim so that you can get reimbursed for it. And you want everyone around the world to be able to use that same system. Um and then in addition to that, it actually does help from like a research standpoint. So, oh, I want to look at all the people who had uh chest pain in the last 30 days in this region of the United States or whatever it is. Um, I want to look at all the people who had Lyme disease and see where it was concentrated. So it could be used for research as well. So it's like sort of a method of communication, right?

SPEAKER_00: 2:29

Yeah. So let's, yeah, no, that's right. And so I think that it's and because it's standardized, it's useful, right? It's useful in so many different contexts.

SPEAKER_01: 2:36

Exactly.

SPEAKER_00: 2:37

I remember when I was a uh patent lawyer, uh, we had some a similar thing with um with patent and class, it's what it's called classification codes. And basically it was a similar idea where a patent had a certain topic, you know, it was maybe a medical device or a some kind of mechanical device, or maybe it was a kind of screw. And then you had all of these different types of categories and classification codes that were for each of these um different areas, so to speak. And the purpose of that was to quickly assign the right examiner uh to examine a person with the right qualifications to examine something. Um, and so here, of course, it serves uh the purpose of you know helping with the reimbursement, right? I mean, that seems to be the most number one biggest issue.

SPEAKER_01: 3:17

Right. But I think the concept is the same. It's sort of like, okay, here's this parent tree. Like, oh, these are all the codes about X, like all the codes about chest pain. And then you can go in with greater degrees of specificity if you want. Yeah. Um, and the more specific you get, the the easier it is to bill for that kind of a thing, right? So there could be like chest pain unspecified, but then more specific, you might have like precordial chest pain or pleuritic chest pain, or maybe it's some sort of heart attack. Um, and so the more specific you get, the more information you're communicating. But I will also say that in the emergency department, it was often like the last thing on my list of things to do, right? If I So let's talk about that, right?

SPEAKER_00: 3:59

Let's talk about what type of note-taking is happening and when does this kind of come in?

SPEAKER_01: 4:04

Yeah. So this would be integrated into our notes and the electronic health record. So I might have someone, I'll just run with this chest pain thing, right? I might have someone come in with chest pain and I'm recording everything from the the history of their current illness, their um examination, right? What did their heart and lungs sound like, the tests, results that the tests that I ordered, right? In EKG, labs, a chest x-ray, and what the results of those are when they do come in. And then I want to put in some sort of diagnosis or code to build them for. And I might be doing this before I even have all the information back, right? So in this case, if I don't have all the information, I'm at least um putting in an unspecified chest pain ICD 10 code to show why they came in. Although if I went back later, I might want to add a greater degree of specificity if I get it. Or sometimes I wouldn't necessarily have it, have that degree of specificity. I might just have this like they came in with nebulous chest pain. I made uh a decision based on the risk stratification I did. You know, their labs were normal, and I'm not really sure what the cause was, but we felt they were safe to go home.

SPEAKER_00: 5:11

Now, are there ICD 10 codes? Um, it seemed like the example you gave of chest pain was like a symptom. So there seems to be a list of symptoms that are ICD 10 codes, but there are other also ICD 10 codes for say like procedures and other things. Like it seems like everything that you bill for should have a code with it, right?

SPEAKER_01: 5:28

Yeah, everything that you bill for does have a code with it. And um, and just to get into some nuance, like the ICD 10 are more around what the diagnosis is, or and you're pointing out correctly, some of them are like very nonspecific and relate to symptoms because sometimes you don't fully know. Um, if you're doing a procedure, that's usually with a a different kind of code, a CPT code. But the concept is the same, is that like you're trying to communicate what you think someone had and what you did for them and be able to bill for it. And by the way, like I, you know, in a fast-paced environment, it was the last thing on my list to do, which is why we had billers and coders like looking at our charts afterwards and um editing, often editing things because, you know, it wasn't something that I was necessarily an expert in, and the hospital or the system that I work for would want to try to maximize the reimbursement and the communication. And so they would have someone look over it to make sure they could make it as specific and accurate as possible.

SPEAKER_00: 6:28

Yeah. See, that's super interesting to me from a, you know, how it shows you how important that is that they have a whole system in place that not only allows you to record the the numbers uh of the codes, but also have people verifying it to make sure that the right values are entered and so on. Um, so when you're actually putting it in, I'm imagining like some kind of digital interface, like you talked about, where you write some notes in, like in just a normal language or whatever. And then at the bottom, I'm imagining some kind of drop-down that has the codes in it, and where you just kind of choose one. Is that kind of how it works?

SPEAKER_01: 6:59

Yeah, I mean, it there are so many codes, right? So it's actually more like if you're buying something online and you are typing your address in and you are typing in like 17 Warren Ave, and it's like, oh, do you want the 17 Warren Ave in Nevada or in Texas or in Massachusetts? It's kind of like that. You start to type in chest pain and it brings up all of the different, you know, things that start with chest pain. So you can start to choose from that menu. I mean, I would honestly choose chest pain unspecified most of the time, which speaks to like the issues that you have, especially in a fast-paced environment of like not getting specific or accurate enough. Like manual entry of ICD 10 codes is like just not that great.

SPEAKER_00: 7:43

Yeah. And and when you enter the code, like you start getting some information from the patient and you start putting in a code. And maybe uh later when you get more information from the labs or whatever, do you go back and change them, or is that set once you've written it in?

SPEAKER_01: 7:55

Oh, it's not set, but I I I'm probably guilty of not going back and changing it. Uh honestly, if I a lot of times I would do the charts after my shift or later on. I I often didn't have time to do them like on the fly or on the go, just because you're, you know, bopping between so many patient rooms. But um it's definitely a few episodes of the pit.

SPEAKER_00: 8:17

Yeah, right.

SPEAKER_01: 8:18

You know what it was like.

SPEAKER_00: 8:20

Yeah.

SPEAKER_01: 8:21

Um but yeah, it's just not something that I would like prioritize getting the billing like exactly correct. Uh and it's not something that I learned in medical school either. It was sort of on the job learning, oh, if you want to get reimbursed and you want to get um, you know, all of like the reimbursement for the procedures and for the diagnoses and for seeing this patient, you want to make sure that you fill out this, that, and the other thing and make sure your code is as specific as possible. And that's all good and fine, but it's not, it's not when you're when you've got sick patients. That's right.

SPEAKER_00: 8:56

That's not it's not like the top of your list, okay? And I was gonna ask you about the training you receive for it, but I'm assuming there's no training for it because that's just it's a it's another uh skill that you have to only apply on the job in that hospital setting, as opposed to it's not like a piece of medical knowledge, right? It's got nothing to do with it.

SPEAKER_01: 9:12

Well, it's not a piece of medical knowledge, but it is tied to reimbursement, right? So, like I like I said, like nothing in medical school, but later on, you you could take courses and you were trained on the job to make sure you tried to maximize the reimbursement. That was just like you know, it's something that's necessary. It's a necessary piece of like administrative paperwork that you have to do.

SPEAKER_00: 9:32

Yeah, and that's a great way to segue into the point that we want to make here, which is that um, and this paper specifically that we looked at was about maybe having an AI system do some of that for people, do some of the heavy lifting. It seems completely reasonable, right? Given all our discussions, especially in our past episodes, but also in general, uh you would think that an AI system could just pick up uh these medical notes, read them, and then assign them a code because that's what you're doing. And these systems are supposedly intelligent and smart and capable of doing it, right?

SPEAKER_01: 10:02

Yeah, and it's something that like honestly would go to the wayside. Like, I just didn't have time for it. It's great application of AI. It's boring, it's not exciting, and um, yeah, paperwork's just like not my thing.

SPEAKER_00: 10:14

Yeah, and that's what you know, tools are meant to automated tools are meant to help us with, right? Doing getting rid of all the boring stuff so you can spend more time with the patient and actually pay attention to what's going on and think about your brain capacity can be used for things that are actually related to the patient, not related to necessarily reimbursement.

SPEAKER_01: 10:29

Right. So, okay, so then in this paper, what they talk about first is when they tried to do this with just LLMs, they would just like hallucinate. They would just like make up codes that didn't exist, right? Or they would like assign the wrong codes.

SPEAKER_00: 10:45

Yeah. I I think that's an interesting question. And hallucination is the biggest problem with just LLMs, right? So again, you know, I think there's a it's worthwhile thinking about how LLMs work at a very high level and understand where why it hallucinates at all. Um, at a very high level, LLMs, what they are, are neural networks that are very good at figuring out the next thing in a sequence of things. So imagine a sequence of words that forms a sentence, they're very good at predicting the next word in that sequence. And the way they've been trained, especially today's Chat GPT and so on, is that they've been trained by reading uh trillions of words on the internet. So they roughly know how language works because humans are the ones writing those words on the internet. Um, and they are able to pick out the patterns of how we use our language, pick out the patterns of how we talk to each other. And some of that language on the internet is like, I don't know, Reddit posts, but some of the languages more technical knowledge about, you know, scientific papers and uh potentially uh doctors talking to each other about things. ICD 10 codes? I'm sure anything that's publicly available on the internet is available as input to these systems for training. So, yes, that would be in their in their body of knowledge, is ICD 10 codes. If you ask Chad GPT what is ICD 10 codes, it would give you an answer. It and you gas it for a few examples and it would do that too. So that's part of the knowledge that these systems have in them.

SPEAKER_01: 12:14

So I could say, like, I want the ICD 10 code for contusion of the right hand, initial encounter.

SPEAKER_00: 12:21

Yes, and and again, so these LLMs are very good at coming up with the next word. Um, somebody filled figured out along the way that maybe if you taught them how to, instead of just figuring out the next word, try taught them how to answer our questions, as opposed to, you know, if I asked you a question, what's the capital of France? You are you know my intention is to get the answer to that question, right? My intention isn't for you to ask another question back at me. Right. And and so just because I asked a question doesn't mean the next thing should be also be a question. So these LLMs that aren't uh were initially made would just would just ask you another question. Whereas somebody figured out that you can do something called instruction tuning, which is teach it how to answer questions and be uh teach it how to take instruction, basically. And by doing that, now it's actually beginning to answer questions. Now, where does it answer? How does it answer those questions? Well, by predicting the next word. It knows that people, when they ask what is the capital of France, that the next word should be or the next few words should involve Paris somehow. And that's just from patterns that it's seen before and the types of knowledge that it already has. Um but it, you know, and so the degree to which it's thinking about this and reasoning is not very high. Um, more modern LLMs, um, the chat GPTs of the world and the more new, the newer versions of that are doing a little bit more quote unquote thinking. Uh that is, they put an answer out, but then they put multiple versions of that answer out and then they reflect on it and then give you the answer. But every one of these little loops that it's reflecting on is still predicting the next word. The power of these LLMs that people have come to now appreciate is the fact that when you gave them trillions of words from the internet, it showed uh what appears to be reasoning. It showed what appears to be thinking. And that's a very powerful thing, but we have to remember that it's not actually reasoning, it's not actually thinking. At the core of it, it is still predicting the next word. Um, and and so that's some that's uh that's at the core of why hallucination happens. Hallucination happens because uh it is predicting the next word. It is not thinking whether that next word makes sense in the context of the particular question you're asking or all the other pieces of information that you gave it. It is even if you give it a lot of the context, even if you tell it everything it needs to know, it might not know that this piece of everything that you gave it, line three, is the most important piece, or line 12 is the most important piece. Humans have a better ability to understand not just what's being said, but like build a model of a world in which that what is being said is actually being played out. If I told you there is a pink elephant dancing on the table, your thought is a pink elephant that's fat dancing on the table. It's not uh it's a visual thought that captures the state of the world. And if I told you the pink elephant was dancing on the edge of the table, your immediate thought might be, oh my god, it's gonna fall off the table, what else is on the table? Like that, you're you have a world that you've built in your mind. That is not how LLMs work. And so that world that you built in your mind helps you from hallucinating um things that don't make sense in that world. And these systems don't necessarily have that. So even in this pink elephant world, which is ridiculous, um, there are still things that you have decided are not ridiculous in this world. And if things are being said to you that are outside of that, you're gonna question them. You're gonna be like, wait a minute, that doesn't make any sense. Where did that come from? But um the same thing happens here, which is you don't have any system in place that actually checks for that. There's no way for the LLM to know that the knowledge that it's pulling is actually the relevant correct knowledge for this moment in time. So you could ask it a question about ICD, ICD 10 codes, but and it could give you an answer that looks reasonable, that is one of the ICD 10 codes, but it might be hallucinating because it hasn't actually verified that's that corresponds to what you really want.

SPEAKER_01: 16:05

It's interesting because, like in this particular paper, too, there's a there's a difference from what I'm understanding from you. There's a difference between being like, oh, what's the ICD 10 code for acute cystitis with hematuria versus, hey, read this whole chart and tell me what the ICD 10 code should be. Those are like two very different things. And I and I like kind of want the latter as a physician. I don't even want to have to code it all. I want you, I want to be able to be like, here's the chart and like give me all the codes that should go with it. And maybe the codes that could go with it include like procedural codes. Maybe there's a primary ICD 10 code and a secondary ICD 10 code, right? If someone had a seizure and then broke their arm because of it, I want it to get both those ICD 10 codes in there.

SPEAKER_00: 16:51

That's right. That's right. Yeah. Yeah. And so this sort of system, um, by itself, the LLM, although it has a large, large body of knowledge, the more specific the knowledge becomes or the requests are, the harder it is for it to get that right. Like if I asked you the question, what is the capital of France? It's not going to hallucinate that. Because that is everywhere. It's seen it all the time, it's seen it, you know, it knows the answer inside out. It's not going to necessarily get that kind of thing wrong. If the capital of France changes tomorrow, it'll always get it wrong. Because that's what it's memorized, right? But um, but it's like this more specialized thing where you gave it medical notes and context and this patient having this experience and this and that, it starts to get more and more specific. And then relying on the patterns that it's found from human language use might not be enough for it to figure out if for sure whether it's doing the right thing and picking out the right answer.

SPEAKER_01: 17:45

Okay. So I I I read this paper with you and they used something called RAG. What does that stand for?

SPEAKER_00: 17:53

Oh, RAG. Yeah, not that not the stuff in the kitchen, right? Not the stuff in your kitchen. That's it's not a RAG. RAG uh is an acronym for um retrieval augmented generation. Uh we'll break it down. So uh retrieval augmented generation is a way as a technique uh that was invented a couple of years after LLMs came around uh to specifically address hallucination problems in these specialized cases where you don't want to rely on the LLM's default knowledge, where maybe you have specific answers or maybe a manual or some sort of a book from which it should read from, as opposed to just make up things on its own, right? So retrieval augmented generation is what um it was meant to address that problem. And the way it works is actually pretty straightforward. So I think I can explain it. Um LLMs work when you provide them a prompt, you give it some question, maybe some context, some text, that that entire context is then put into the system and then it produces uh the answer for you, right? Um but that context uh it needs to be potentially augmented with the correct information as opposed to just your question. So if instead of asking it I C D 10 codes just out of the blue and hoping that it gets the correct description, let's say I had a chart that explained every single ICD 10 code with the specific language describing it nicely. And I said, hey, here's here's a patient's medical note. Here is my ICD 10 chart uh that corresponds to the, you know, each of the codes and what their descriptions are. Here's a set of documents. I have a thousand documents that corresponds to that. Each one, you know, let's say each document has something about an ICD 10 code. Um and you here's your thousand documents, tell me which of the codes best applies to this. Now, what you would want it to do is to figure out uh from the descriptions which one is the closest one to the current uh that's most relevant, most specific to the current note that you have.

SPEAKER_01: 19:44

So you're giving it like a limited menu to choose from, and you're giving it like a pre, like you're giving it examples to you're giving it specialized context, right?

SPEAKER_00: 19:54

So then uh what the way retrieval so I've talked about so the uh default LMs generate text. That's the generation part. We want to give it this extra piece of information when it's relevant, the correct extra piece of information when it's relevant that to augment that process, that's the augmented part. Well, how is it gonna get that extra piece of information? It has to retrieve it. That's the retrieval part because we don't know upfront which ICD 10 descriptions we need to be giving it, right? Right. So what uh so what you do is you you involve a different model, uh neural network model, uh, that searches through the entire um set of ICD 10 descriptions and chooses um you know the the the one that's closest to it. Okay closest in language descriptions to the ones to the note that you already have. Maybe your note describes some patient's situation and the description has that. You know, it finds the one that's the closest match, and then it suggests that the code from the documents you gave it, that is the closest match. I got you. Okay, so that's the retrieval part. So it's retrieval augmented generation. So when you uh give it a note and say, hey, find me the ICD 10 codes, what it does, a model goes, takes that initial note, finds the nearest description, let's say, of of of um a code from the uh set of documents you gave it that has all the descriptions of all the ICD 10 codes, figures out if you said, you know, give me the nearest 10, it gives you the nearest 10 ones and puts that back, the descriptions back in the context, and then the LLM uses now this augmented context to provide a better answer for it.

SPEAKER_01: 21:26

So the whole idea is like let's eliminate or significantly reduce the hallucinations so that we can get like a response that's more accurate.

SPEAKER_00: 21:36

Yeah, it's more accurate, and it's more accurate because now it's actually pulling stuff from the document. And more modern systems can actually cite things, right? Where they found it in the document and things like that. So you'll see that's it's a pretty standard practice now to use a rag system uh for many things. I mean, if you go to use perplexity, ChatGPT has that in and built in as well. A lot of these systems have already these already versions of RAG built into them exactly to bring in specialized knowledge. When you upload a PDF and you ask ChatGPT answer the question from it, yeah, it's reading the PDF. It's not quite doing our RAG, but it's doing something similar. It's reading the PDF, adding it to the context, and then you answering the question based on the context.

SPEAKER_01: 22:11

Okay. That makes a lot of sense. And it makes sense why it worked better than just using like a straight up LLM.

SPEAKER_00: 22:18

That's right. That's right. And sometimes the word zero shot is used, um, uh zero shot prompting. Zero shot just means uh it's funny because the term goes back to the days of machine learning where you had uh in traditional machine learning, you just had lots of examples that you had to give the the system, uh-huh, uh, you know, correct answers. And um and and then somebody said, No, we we can't give it so many examples, we had to give it fewer examples. So we do few shot learning. That is give it a few shots, a few examples, and it would know how to do the task. And then zero shot was you don't give it any examples, you just ask it to do the task, which is where we are with LLMs. When you ask Chat GPT a question, you don't give it five examples of good answers that you want. You just ask the question, right? So you're just doing zero shot. And so what I've described so far is just zero shot. Uh, and then we add rag on top of that. On top of that, and that's when it gets much better. That's when it gets much better. And what these guys found was that you could take a model that does very poorly, zero shot, and substantially improve it for this task, for this ICD 10 code task, um, using RAG.

SPEAKER_01: 23:21

Yeah, and um, and it did it did better than the physician's coding, which in a lot of cases, which I'm not surprised about, to be honest with you. And I think they give some really good examples here where like I would probably have done the same kind of code as a provider. Like, there's one example where the provider gave a code for epigastric pain. That's pain in the, you know, center upper part of the stomach. That might even have been the complaint of the patient who came in, right? Um, it's not very specific, doesn't tell you why they had that pain. We don't know what caused it, whatever, right? It's just it's something that checks the box so that you can bill. Um, but when they use the RAG-enhanced GBT4, what they got was something that was both more accurate and more specific. It was alcoholic gastritis without bleeding. So there was like inflammation in the stomach due to alcohol consumption, and it did not result in bleeding. And it caused epigastric pain, right? It caused that same abdominal pain that was there with the provider description, but it's actually showing you, okay, like here's why it happened. Uh, here's like the underlying cause. Now, I have to say, I opted for general codes oftentimes because, like, like I said, sometimes I was coding and I didn't have a diagnosis yet. Or uh sometimes it's to avoid a liability risk, too, right? Like, oh, epigastric pain can be from lots of things, right? It could be from like a H. pylori infection, it could be from um gallstones, it could be from alcoholic estritis. And just because someone has alcoholic gastritis doesn't 100% mean their pain was from that, right? And so sometimes I would back off or just use a more general or non-specific code because you can't always be 100% sure. Um, but I think from like a billing and claims perspective and from being able to do research perspective, it is nice to have that more specific and more accurate code rather than the plain old epigastric pain.

SPEAKER_00: 25:25

Yeah, yeah. And I think it's worthwhile just spending a couple of minutes talking about what about how they evaluated this, this, the system. Like because we talk about how it's better than humans or worse than humans. So what does that actually mean?

SPEAKER_01: 25:36

Yeah.

SPEAKER_00: 25:37

So I, you know, I you know what one thing they did was first they checked uh whether the thing that was being produced was actually valid codes. Because it's very possible that it was just producing random digits and letters that don't actually meet uh you know any requirements. It's part of the hallucination problem, right? Right? Yeah. Hallucinating codes completely. Um, well, you know, with RAG, they were able to have some very good results in terms of the validity of the codes, and some of these models were perfect, 100% producing only valid codes. Uh, they also checked, um, what they did was within they created an initial data set in which they took real uh patient notes and they had real coded, like actual data from um humans that you know they were, I think from Mount Sinai Hospital. Um, and so they had that exact data. So they put those notes into the system to evaluate it and compare it against the humans to check if the codes that the human doctors originally assigned those um patients, how close were they, were they an exact match to what this LLM system is producing? Um now that there wasn't that close of a match, there was actually quite a bit of discrepancy. But what's interesting is that, like you mentioned, doctors make mistakes too, right? So what they did was they said, okay, maybe that first match that they came up with wasn't perfect, but maybe there's we have to actually look at this again. So they had uh other physicians look at both the LLM generated codes and that original human generated codes and decide which one they prefer more and which one they thought better captured the notes. Blinded. They were blinded, right? Yeah, okay. And they it was sub the LLMs were picked substantially more uh more than the humans, than the human choices by human physicians.

SPEAKER_01: 27:12

Yeah. I mean, I'm not surprised. I'm looking at this example of alcoholic gastritis without bleeding and being like, yeah, that's a better code than epigastric pain for sure. Yeah. Or there were a lot of them where they're just like unspecified, like the ER doctor, which again, this is exactly what I would do. Patient came in with dysphagia unspecified, you know. Um, they had some issue with swallowing, but the LLM gave it esophageal obstruction, like something was actually blocking their esophagus, so they were unable to swallow. So it just gives you more information, right?

SPEAKER_00: 27:41

Yeah.

SPEAKER_01: 27:42

So it's like easy to say as the reader, I want to see it. But when I was coding, like I was coding with the nonspecific codes myself. It was just like, you know, I checking the box, getting the job done, not an A plus job. Maybe the LLM could do an A plus job for us, though.

SPEAKER_00: 27:56

Yeah, and they even use the LLM as a judge. Like just like I described physicians looking at both the human and the LLM uh outputs, they also had an LLM look at human and LLM outputs and say which one was preferred. And the LLM was consistent with what the human judges were doing as well. Interesting. So you can automate some of this process, right? So, you know, maybe the maybe some folks, some of the work that the folks do uh reviewing and checking these things can also be uh assisted with these sorts of tools, right? Yeah. Um not just not just the doctors. So I think there's a lot of I think overall what I find exciting about this work is that this is an example where it's a task that really gets in the way, but it's really important and potentially automatable. And what we've seen so far is yes, it is actually quite automatable and maybe even better to be automated.

SPEAKER_01: 28:41

Yeah. I mean, that's what's interesting to me too, is I I think this is a great use case for it. It's not that you don't necessarily need human oversight, but it's uh a task that um is definitely not the most exciting, not the biggest part of a provider's day. Um, and then when you have an LLM that comes in and doesn't hallucinate and is able to give you that nice accurate specific code, that's actually helpful for everyone in the process. It's helpful for you know people who review the chart, it's helpful for your billers and coders, um, it's helpful for uh researchers as well. And so I think this is a great application um of LOMs, especially because they were able to really get rid of the hallucinations with RAG.

SPEAKER_00: 29:26

Yeah.

SPEAKER_01: 29:27

Awesome. Well, thanks for joining us. We'll see you next time on Code and Cure.

Laura Hagopian

Host

Vasanth Sarathy

Host