#5 - Doctor's Notes: When AI Writes Your Medical History Artwork

Code & Cure

Decoding health in the age of AI

Hosted by an AI researcher and a medical doctor, this podcast unpacks how artificial intelligence and emerging technologies are transforming how we understand, measure, and care for our bodies and minds.

Each episode unpacks a real-world topic to ask not just what’s new, but what’s true—and what’s at stake as healthcare becomes increasingly data-driven.

If you're curious about how health tech really works—and what it means for your body, your choices, and your future—this podcast is for you.

We’re here to explore ideas—not to diagnose or treat. This podcast doesn’t provide medical advice.

All Episodes

Code & Cure

#5 - Doctor's Notes: When AI Writes Your Medical History

August 14, 2025 • Vasanth Sarathy & Laura Hagopian • Season 1 • Episode 5

What if an AI could write your medical chart—and what happens when it gets it wrong? Doctors have long lamented the paperwork that comes with every patient encounter. “Charting was the bane of my existence,” admits Dr. Laura Hagopian, an emergency physician who’s spent countless hours piecing together fragmented notes and outdated records. Could artificial intelligence finally lift this administrative weight?

Recent advances in large language models promise to generate discharge summaries as accurately as seasoned clinicians, potentially returning precious time to the bedside. By training on thousands of patient encounters and lab reports, these systems can stitch together coherent narratives of care—micro-diagnoses, treatment plans, and follow-up recommendations—at a speed no human chart-writer can match.

Yet with speed comes risk. When an AI hallucination slips into a diagnosis and becomes enshrined in a patient’s record, who is accountable? Dr. Hagopian highlights the stark difference between human and machine error: “I feel very different about a human making a mistake compared to an AI making a mistake.” As trust in automated documentation grows, so too do questions about responsibility, oversight, and patient safety.

In this episode, AI researcher Vasanth Sarathy and Dr. Hagopian peel back the layers of these complex issues. They explore the nuts and bolts of AI summarization algorithms, discuss promising clinical trials, and weigh the ethical dilemmas of delegating clinical judgment to code. How do we ensure that efficiency doesn’t override accuracy when every data point can mean life or death?

Whether you’re a clinician craving relief from chart fatigue, an AI developer pushing the boundaries of what’s possible, or a patient curious about who’s really recording your health story, this conversation offers a vital look at the future of medical documentation. Join us as we navigate the promise—and the pitfalls—of letting machines tell our most critical health narratives.

References:

Physician- and Large Language Model–Generated Hospital Discharge Summaries
Christopher Y. K. Williams, et al.
JAMA, Internal Medicine, 2025

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

Speaker 2: 0:01

A language model just drafted a patient's entire clinical summary in seconds, but can we trust an AI to get it right when the stakes are so high? We're diving into the promise and pitfalls of AI-powered medical documentation.

Speaker 1: 0:25

Hello and welcome. My name is Vasant Sarathy and I am an AI researcher and today I'm with Dr Laura Hagopian, an ER doctor, and we are going to talk a lot today about summarization.

Speaker 2: 0:39

Hi, laura, hi.

Speaker 1: 0:41

How's it going?

Speaker 1: 0:43

I'm doing well. How are you Good good, hi, how's it going? I'm doing well. How are you Good, good, yeah, I think one of the topics that we wanted to really touch on and discuss was this notion of AI being used in a clinical setting, and particularly with respect to the fact that AI can generate lots and lots of text, as we all know chat, gpt and so on. Maybe it can summarize documents, and it can, and maybe it can, summarize medical things. So we thought that maybe medical summarization could be a good topic today.

Speaker 2: 1:15

Yeah, and I have to tell you that charting was like the bane of my existence. It takes so much time. I mean, I remember back in the day when there were paper charts and they were definitely not efficient. But when we transitioned to electronic health records it was even slower and we spent a lot more time documenting. And the other thing with the templates in a lot of these charts is that all this stuff gets transferred forwards but no one's looking at it, and so the charts are just full, full of so much information and it's hard to piece together like what's important.

Speaker 2: 1:48

So sometimes, when I go and look at a patient's chart and try to figure out who is this person or what are their conditions, you know there may be some medications on there that are really old, no longer relevant. Or I might see okay, someone has this diagnosis of congestive heart failure. Now let me try to find the imaging study, the echo, that shows exactly what their EF ejection fraction is. And so I'd be piecing through different parts of the chart to try to understand or get a picture of who a patient is. But that's time-consuming, and then if you try to generate one on your own, oh, what's a summary of this patient or what happened to them during this hospitalization. That's time consuming too, because now you're looking through the daily notes, the specialist notes, all the labs, all the imaging studies, all the medications and trying to create a coherent whole, and this is a great place where I think LLMs could step in, if we feel like we could trust them to do so.

Speaker 1: 2:46

So, before we get into the specifics of the LLMs themselves, I, as not being a doctor, I know that there is some summarization that's happening in terms of just humans, some doctors summarizing, like you just mentioned. Like you just mentioned what. Can you give us a little bit of the lay of the land, of the types of summaries that you would do at in, in, in, in the hospital, for instance?

Speaker 2: 3:12

Yeah, I mean well, so summaries could apply outside of the hospital too right Like any type of summary could be useful.

Speaker 2: 3:18

It could be something small, like oh, someone went for an ER visit, right, and we want to know what happened there, and that seems like it's got a nice little box around it. It could be you know, someone had a radiology study and what did the x-ray show? What's a summary of that? That also seems like a little bit on the simpler side, like you could put a box around it and then you could say okay, well, I want a summary of everything that happened during someone's hospitalization. What if they were in the hospital for 37 days and half of those were in the intensive care unit? Now you're taking tons of notes and different types of studies and trying to put them together.

Speaker 2: 3:53

Right, they may have gone for 17 x-rays, they may have had multiple medication changes. Not only are they getting daily notes from the ICU itself, but there's notes from nurses, there's notes from specialists, and so there's all these different sources coming together and labs too, and so all of that feels like a bigger chunk to summarize, at least to me, like when I had to summarize these things as a physician. And then, even broader than that, it's like well, if I want a basics of like who is this patient when they're coming in to see me for the first time? It would be great to have like an overarching summary. Okay, this is this person. They have this past medical history. These are their medications, these are their allergies, these are the surgeries they've had, this is their social history and that's kind of how our brains think. But it would be nice to have it all updated and all in one spot.

Speaker 1: 4:48

It seems like there's sort of a couple of dimensions, at least for summaries. One is the types of information that is needed to prepare the summary, which could be, like you said, something kind of boxed in something small, or it could be something that involves lots and lots of notes over a long period of time. So that's like, sort of, in my head, one dimension of what makes a summary. The other dimension, it seems like, is who is it being written for? You know, we talk a lot about writing the summaries and how difficult it is to write the summaries, but at the end of the day, someone's using the summary for something. It might be you yourself using somebody else's summary, but can you talk to us a little bit?

Speaker 2: 5:30

more about who's reading these summaries and what are they doing with it. Yeah, I mean, this is a great example of like when you say you went to the emergency department that summary your primary care provider would want to see. It's like what happened to this person when they went in for, say, an asthma attack. Okay, what medications were they discharged on? Now the PCP is going to know hey, I'm going to call them and schedule a follow-up appointment. Anytime that you're transferring care in between providers, it's important for each provider to understand what happened along the way. Say, you're getting discharged from the hospital to a nursing home, the nursing home needs to know what happened and what the plan is moving forward. And so all of these touch points are and all of these summaries are ways to basically communicate in between providers, and a lot of these summaries will also end up in someone's personal medical chart and in my chart, so someone can read them about themselves as well.

Speaker 1: 6:20

That's interesting. And going back to the issue of the writing part itself and the data, what are the sort of the types of things you're looking at to prepare one of these summaries? I mean, let's just take an example right of something that involves multiple things that you have to look at. What are those things? Are they just other people's notes or are they lab results? Are they images? Are they all of the above? I mean, what are we?

Speaker 2: 6:46

Yeah, I guess it depends on, like, how far you want to push this summarization and I kind of push that question back to you is, like, what can the AI do? Because it could be just like a simple single visit or, you know, a single x-ray. But like, ideally, what I'd want to see as a clinician is something more complex, where it takes, you know, notes from different people. A clinician is something more complex where it takes notes from different people lab testing, medication changes, radiology results and puts them all together. In my mind that sounds like it's probably harder to do. It's harder for me physically to do. That is to basically synthesize everything that's happened and of course, I'm not perfect at it either. I don't know if AI would be. I guess that's what we'll be talking about, but you wanted to take all that information and get it in a really usable fashion for the next person who's taking care of that patient.

Speaker 1: 7:35

Yeah, I mean, I think the reason I'm asking all these questions is sort of to set the stage, to set the domain in which the AI could be useful, because before we jump in and start throwing the AI around in these cases it's I always believe, at least, that it's best to understand what the problem is, or at least what the domain is.

Speaker 1: 7:53

What are the constraints in the domain? What is the expectations? I mean, presumably one subclass of medical summaries is like discharge summaries, where you have lots of data and across long periods of time, and in that setting you have multiple types of data which the physician is looking at and piecing it all together. And you're right, that makes it a much more difficult AI problem than something that's more compact. But even the more compact ones, even if you're writing a radiology report, you're still looking at images and interpreting them, and an AI system, in theory, is more constrained in that way. But there are other challenges that come up even in that setting. So it's always worthwhile to really like dig in and understand what the actual world, the domain, is.

Speaker 2: 8:37

What the?

Speaker 2: 8:37

ask is what the ask is right. Right, exactly. I think the ask is that you want something that well, first of all, now you're not taking provider time, so, like that's, you're gonna free up, you're gonna decrease burnout with a solution like this. But then then, what do we actually want out of the summary? Right, and out of the summary you want something that is easy to read, it's concise, but at the same time, it has all the information, right. You want it to be comprehensive and you want it to be accurate. You don't want it to hallucinate and say, oh, this patient has a condition that they don't have, and you want it to be correct in terms of, like, say, someone started a new medication. You want that medication, the duration of time, all of that to be very clear. So I think there's probably a balance to be struck here and, like I said, humans are not perfect at this either, but it's like I would want to see something that is both concise and comprehensive and, in addition to that, is highly accurate. Yeah, what are the chances?

Speaker 1: 9:41

Right, right, right, all of the above right, I mean. I think that's a great segue to discussing specifics about LLMs, and so one of the benefits of a lot of natural language processing work that people have done is, in fact, for summarization, and even LLMs, specifically across other spheres not in the medical sphere necessarily, but they're very popular for that, you know, and for simplifying are particularly known for for doing pretty well and they're capable of, you know, sort of digesting large amounts of text and they're able to produce these summaries, and so it seems like a perfectly useful, a perfect tool for this purpose.

Speaker 2: 10:25

So let's do it.

Speaker 1: 10:26

Right and in fact some recent studies have done it right, and so there's some, and we'll link those papers in the show notes. But there's been some recent work this year, even this year and this past year, showing that, for instance, that LLMs are pretty good at generating discharge summaries, and what pretty good means they. You know, as researchers in computer science you have to develop metrics and I think that's a very critical question that we'll have to address later, I'm sure. But in some of these studies, you know, they use notions of conciseness and how comprehensive these summaries are and how coherent these summaries are. And these are ways to kind of measure the quality of the summary. And these studies have suggested that LLM summaries are actually quite comparable to human expert summaries across many different metrics, like this, which is promising, which is wonderful.

Speaker 1: 11:20

But these studies also took into account errors and issues that come up and broadly it seems like you can sort of categorize them into different ways. So one is omission, which is the idea that you've maybe left out a piece of information that shouldn't have been left out, like which could be crucial information data, or some kind of ambiguity where you don't, where a human expert might need experience to be able to interpret a particular situation. That could be read one way or the other, and if it's incorrectly interpreted, that's an issue, right. The other issue is modifying facts to make them inaccurate, and that can happen quite easily. I mean, you see something that has you know something, that a piece of data that said this took three days, but when you summarize it, you said five days.

Speaker 2: 12:16

Okay, so this I need to add to my list of what I want in this because I want it to be trustworthy. I want to be able to read it and say, okay, like I trust that what that? What this says is correct. I guess that goes hand in hand with accuracy. But, like I'm sure you have more to list off. But it makes me take a step back right away.

Speaker 1: 12:36

Yeah, yeah, I mean that's a consideration, right. I mean we're not placing any judgment right now quite yet. We're just sort of listing the possible issues that could arise when you summarize, so hallucinate I'm placing a little judgment, I am. Yeah, well, you know. The other big category is hallucinations, where you're basically. It's not about inaccuracy, you're just inventing new pieces of information.

Speaker 2: 12:57

So like you could just suddenly have a diagnosis of COPD in someone's chart that they actually don't have. Yes, yes, that's an example of a hallucination which then can get carried forward to future charts, that's for sure.

Speaker 1: 13:10

Yeah, okay, it's a problem, and a lot of the studies that have been done so far have been primarily focused on just using data in its textual form, like other notes or other tables.

Speaker 1: 13:22

It's much more difficult, like I mentioned before, from an AI standpoint, to do multimodal information, which is both clinical notes but also like images and such, and of course, there's also other issues about privacy. So you know, some of these include patient data, some of them don't, and maybe the summaries don't need to include all of that. Maybe the summaries only need to include certain types, removing others, redacting others. So that's another type of issue and some of the AI researchers who have worked on this and building LLMs for this have looked at these issues and, in fact, developed metrics for them and evaluated how well LLMs can do summaries. A recent paper in Nature talked about how, for discharge summaries, they compared humans and human experts in LLMs in doing discharge summaries and the error rates were comparable doing discharge summaries and the error rates were comparable. I mean, it was not like the LLMs' omissions and misinterpretations. Was that much worse than humans.

Speaker 2: 14:21

Okay, but I have a question for you on this one, because you're a researcher, right? So I mean you have to compare to something. But is that the gold standard or is that what we should be comparing to? Because, like humans, have accountability, right.

Speaker 1: 14:40

Well, actually, that's a great point. It might actually transition into our next point, which I think is to me, the most interesting thing that people talk about. When they talk about LLMs replacing a human task like this, especially because there's all this, you know, hey, it works right. What's the problem? And oftentimes the response. When somebody says, wait, llms get these things wrong, they're hallucinations and this, and that the natural response for a lot of people is well, humans make mistakes too. We're not great, we do this stuff all the time, and in fact, some of these studies have suggested that humans hallucinate. Humans modify information, humans omit things. So I wanted to take this opportunity. Actually, this might be a great chance to talk about. This is what? Why, at some level, to me at least, it does feel like, yeah, you can make the argument that humans are making mistakes too, but it doesn't quite feel right. Humans are different, right?

Speaker 2: 15:35

Well, like we're responsible for it.

Speaker 1: 15:37

Yeah.

Speaker 2: 15:38

Right, like I own. I can own it if I made the mistake, or I can work to fix it for next time, or I don't know it. Just it feels weird Like I expect maybe because I'm not as trained as you, but like I expect my ai to be perfect, whereas, like I don't necessarily expect a human to be perfect, but I do expect a human to be accountable. But like, what happens if the ai hallucinates and you get this diagnosis of copd that gets perpetuated in someone's chart and now they're prescribed a medication for it that they don't need, or something like that? I I'm like, whose fault is that?

Speaker 1: 16:13

Right, and to some degree, having a clinician in the loop could be helpful, right. In that sense, the clinician who is asking the AI to perform the summary is the one who's responsible and accountable for that summary.

Speaker 2: 16:28

Okay, but, like at the beginning, we talked about how I don't have time for that. The whole point of the summarization is to like relieve that stressor of not having time to do this complex documentation. Is this wrong? Like it says, the EF was 30%, but now I need to go check and see if it's 50%, and you have to do that for every single piece of it. Then I'm like, well, how useful is this?

Speaker 1: 17:00

Yeah, I mean, I think the accountability and the trust building that happens from it is so central and, especially with these automating tools, people don't talk about it enough. I think I think there's another angle too, from a technical standpoint, which is that, but humans, you can have a dialogue with a human you know, say they're a new person and they're doing these summaries for the first time. They're going to leave things out. So you, first of all, you have some teaching that's involved.

Speaker 2: 17:29

Yeah, we all did this during residency, right? Someone would read your notes, even as a medical student. You would get critiques on what was there, what was not there, how to improve them. And they got better over time, hopefully at least.

Speaker 1: 17:40

And they got better over time, hopefully at least. Yeah. Well, they got better over time and there was accountability, right. The ones that didn't get better over time were reprimanded or were put on a different kind of Such a harsh term.

Speaker 2: 17:49

But yeah, yeah, right, right, right, there was room for improvement and improvement occurred there. You go.

Speaker 1: 17:55

But you could have a dialogue with a human and understand where they're coming from, and maybe they have an assumption that you can easily lift or fix or whatever. And so that teaching process not sure that it completely exists in the LLM world In theory you could prompt it and give it examples and fix it. But I think there's a key point here, which is there's a distinction between behavior and mechanism, and what I mean by that is the LLM is producing words as an output. That's its behavior. There's an internal mechanism of the neural network that generates those words, and with humans it's a little different.

Speaker 1: 18:34

So we are also producing words and we have an internal mechanism, but we are able to instruct each other in a way that actually modifies that internal mechanism, and that doesn't happen with an LLM. So when you teach the LLM, you are producing, giving it more words. That runs it through the same internal mechanism as before, and so you know, fundamentally, I think you're not actually changing what it's doing. So it's not necessarily learning in the same way that we humans learn, and I think that's relevant and important here and that's an it's an ethical standpoint, which is that they are not, because they're not learning the same way, there's no guarantee that they're going to be able to do the thing the next time in a predictable way.

Speaker 2: 19:15

Or any better, or any better.

Speaker 1: 19:17

Correct. Yeah. So how do you build trust? Well, you build trust when you know that the new behavior can be predicted and has changed from what it was before, and that's not necessarily possible with an LLM, and so that's one thing. That's very challenging. The other thing is also very challenging, which is an open problem with AI right now, is this notion of world models, which is that when we talk to each other and we imagine these patients or read a summary or whatever, there is a imagination process that happens that tracks the patient and imagines all the details. There's a model that you build in your head of what the situation is, and then you run the model and you say, okay, this needs to happen, this needs to happen, and then you're able to generate new questions. Oh I, what did the patient get? This medication? Let me go look at the result.

Speaker 2: 20:07

Let me go look at the oh yeah, I would be doing that all the time there's a back and forth, like I saw they had this new diagnosis and was a medication started for that new diagnosis?

Speaker 1: 20:15

yeah, okay, that makes sense so that's because you have a world model that you're trying to make sense of and i'm'm like reasoning and you're reasoning right, and that's not necessarily true with most LLM systems. I mean, there's new technology that's coming out that claims to have some degree of reflection and reasoning, but that sort of metacognition that you're doing is not necessarily present in an LLM system and it's definitely not changing anything internal to that system. So I think that's what makes it different from a technical standpoint as well. So humans making mistakes is not the same as LLMs making mistakes is really what I'm getting at here.

Speaker 2: 20:49

Yeah, no, I feel that and I, like you know, at like a gut level, I feel that I'm not you're coming at it from the technical standpoint, but like I feel very different about a human making a mistake compared to an AI LLM making a mistake, just personally. So I'm curious obviously there's ways to help humans improve, like through training et cetera. Are there ways to make LLMs better at summarization?

Speaker 1: 21:26

make LLMs better at summarization? Yes, so that's a great question, and people have done some work in adapting LLMs just from the core LLMs to something more. So just as a quick refresher about LLMs, they're basically a neural network that's been trained on the internet, and by trained what we mean is they are given a lot of text and they're asked to predict, or they're trained to predict the. They're trained to predict the next word, which means they've discovered patterns across all of language and they know what next word is most likely to follow, given the set of words that was given before. Now the remarkable. That seems kind of simplistic in some ways, right, and you're left wondering how is it that it's generating these intelligent answers?

Speaker 2: 22:05

So often too right.

Speaker 1: 22:07

Well, but what's interesting about this is that I think that's one of the most remarkable discoveries of this whole LLM revolution. It's not that language modeling is new. That's been around forever. People have been developing systems to figure out the next word for a long time. I think what's new is that we have realized that when you give it a lot of data, and especially all of human language, that it's somehow been able to internalize and pick up patterns in that, and some of those patterns help in reasoning, help it with reasoning. Some of it, some of those patterns, are encoded in the way we speak, right. So to some degree, human intelligence is encoded in the way we speak, right. So to some degree, human intelligence is encoded in the way we speak, at least the way people on Reddit or whatever else the data source was used for LLMs speak, and encoded in that human speech is all of this other reasoning. That it happens and that's what's reflected in the LLMs. Now. That's at the very core of what the LLM does. Just predicts the next word.

Speaker 1: 23:01

But people have developed techniques to enhance that and enhance it in a way that allows it to answer questions. If it was just completing words, if you ask it a question, it might just ask another question because that's what you're doing, but because it knows that when you ask a question you want an answer, it gives you an answer. So there's a degree of what they call fine tuning, which is turning the internal knobs of the LLM automatically through training to do a specific task, and so those sort of techniques are what people do over LLMs. There's all kinds of other stuff that people do now to make the quality of the LLMs better. One simple, straightforward thing is what's called prompt engineering or in context learning, where basically you tell it in different ways what you need, and if it doesn't give you what you need, you tell it differently so I'm saying like, hey, I'm a doctor, this patient was just hospitalized.

Speaker 2: 23:56

I need a summary of everything that happened over the last 17 days, including their notes, specialty notes, imaging labs and medications, something like that. And I could keep working at it until I get the output I want. Or I think I could even give it an example Like here's a discharge summary of a different patient and it has all the components I want. Can you produce something like that for me?

Speaker 1: 24:23

Yeah, exactly, and so that's prompt engineering. And prompt engineering tends to be kind of sensitive, which is that if I had said to it you're the greatest summarizer ever, it might do a better job than if I hadn't said that. And it's reactive because it's taking in words and producing more words, and it's sensitive to lots of things with a prompt and, as a result, just because your prompt works for one or two examples that you've been playing with, there's no guarantee that it's going to work all the time.

Speaker 2: 24:53

Is this one of the ones where you can say, oh, I'm going to tip you $10 and it?

Speaker 1: 24:57

will do a better job. Yes, those are the kinds of things that people have observed. Again, people have observed because we don't understand the internal mechanism fully. We have to take its behavior and do experiments with that. I got it, but the prompt engineering is one world. Related to prompt engineering is the context, and what people mean in AI space, at least in LLMs, at least with context, is the things you give it to help it answer your question.

Speaker 2: 25:24

So you could you know you don't have to give it Reddit. Basically is what you're saying.

Speaker 1: 25:28

Well, it's already internalized Reddit. You're not giving it more Reddit.

Speaker 2: 25:31

Okay.

Speaker 1: 25:31

What you're giving it is in the prompt. It's part of prompt engineering at some level, because in the prompt you're giving it a task other things about how great it is but you're also providing all the resources it needs to do the thing.

Speaker 2: 25:45

So you're saying like, okay, you should be looking at these notes, but you shouldn't be like going on the internet and looking up what an EF means, for example.

Speaker 1: 25:53

Yes, and so that direction sort of led people to this technology called retrieval, augmented generation or RAG, which is the idea that the LLM should not just be answering questions out of its head but instead actually looking at specific documents that you have to answer the questions about those documents that you have. So what they do is, when you ask a question, it finds the most it has a whole database of documents. It does a search, separate search to figure out those documents that are most closely related to your question, takes the contents of all those documents, dumps it into the prompt and then runs the AI system. Oh, interesting, that way the AI system now is not relying purely on what it knows from before, but now it's recently been given this prompt with this recent data that it then uses to answer the next question.

Speaker 1: 26:43

And I'm not typing that into the prompt myself, I'm just like You're not doing it, you're just asking the question, but you're telling it look at this database of documents and answer it from there. I got it and it can even cite documents, it can even point to which ones it got it from, and so on, pretty easily. And that's what a lot of tools are using right now. If you look at the various research tools that people use with AI systems, elicit for research, harvey for legal Open Evidence.

Speaker 2: 27:08

Open Evidence my favorite Right right.

Speaker 1: 27:11

Right, that's kind of what they're doing there, right? There's also another line of work which one could do to an LLM is to fine tune it for specific tasks, that is, give it lots of examples of specific things you want and then actually change the internals of the system to produce those outputs. And there are ways to do that. Where you do it at a limited level versus the whole network, they have mixed results.

Speaker 1: 27:37

Things work, some of those don't work, but I do think that and there's issues, there's technical issues with some of that when you fine tune something extensively, you have it forget things that it knew before which could affect its reasoning. Right. The reason LLMs do so well with the reasoning in general is that because they've looked at all this data, they've internalized what it means to reason. Now, if you go and change all that because you have a fine tune task, it's possible that it messes with that original reasoning too Interesting and that's called catastrophic forgetting. And on a related note about the rag thing, if you give it a lot of context, there's also something called a recency bias, where if you give it a thousand new words of context in the prompt, it's going to forget that it's not going to be necessarily coherent with what it said before because of the length of words that it has to go through first before it gets to your output.

Speaker 2: 28:29

Yeah, I feel like that one could apply to humans too. All this new recent information and data came in. We'd probably be paying more attention to it but again. I think it's interesting because of this kind of balance where you feel like, okay, a human can truly reason and can be held accountable, whereas if I'm looking at an ai summary, I just expect it to be correct yeah and if it's not, or if I find that it's like not trustworthy in one or two situations, I'm gonna be like, okay, I'm gonna put this aside right, right and that's it right.

Speaker 1: 29:06

I mean, you lose trust immediately and that's not great. And ultimately, I think the issue here is about LLMs making judgments, which is kind of it's sort of a meta issue in some ways, because it's deciding what to put in and what to leave out.

Speaker 2: 29:25

It has to right Because it's a summary. Like when I'm writing a summary, I'm deciding what to put in and what to leave out. It has to right Because it's a summary, like when I'm writing a summary, I'm deciding what to put in and what to leave out. But how does it? How does it decide what's the most important Right?

Speaker 1: 29:34

And not only that, it's also deciding what to stress and what not to stress, what to place in front of you and what not to place in front of you. And there is this notion. Well, llms are very linguistically competent. They are very capable of speaking to you in a way that is very clear and it sounds like they're you know. I call it the illusion of competence, because it seems like they're speaking fluently, but that doesn't necessarily mean they understand everything.

Speaker 2: 29:57

Oh, I've definitely. I feel like I've experienced that before. I've read something where I was like, oh yeah, that sounds, that sounds correct. And then you take a step back. You're like, wait a minute, just because it sounds good doesn't mean that it's actually correct. But I feel like it would be very easy to fall into that trap if you read a summary and you were like, oh yeah, sounds good, like you're not necessarily critically thinking about each component of it. That takes a lot of effort and energy, exactly.

Speaker 1: 30:22

And that, I think, is one of the biggest problems, which is that you have potentially the risk of over-reliance that can happen as a result of that. People are reading these summaries, they read well, studies have shown that they work in general. So, because I'm tired, I've got lots of things going on I'm just going to sort of sign off on the next summary, and that is if the physician is accountable, well, fine, that's one thing, but then over time you're going to begin to lose some expertise because people are going to rely on this and then produce the next thing out of that and next thing out of that, and there's going to be sort of a snowball effect, and I think that's one of the biggest risks here, at least from my perspective.

Speaker 2: 31:06

Yeah, I feel like I'm leaving this being like I'm not sure this is a great idea, Like I'm not sure this is ready for prime time. I want it. I think it would make our lives so much better and easier, but there are some, some big gaps that still exist.

Speaker 1: 31:20

Yeah, I personally I think that the physicians should work closely with AI people to identify how to measure how good a summary is I know that they have some metrics of conciseness and how comprehensive it is and so on but I think addressing directly what makes a human mistake different from a systemic standpoint, not just from a here is a specific machine learning experiment that we did standpoint.

Speaker 2: 31:47

Yeah, we took a hundred examples and this is what happened. But I agree it is like more of a systematic problem, because if you're seeing tons of hallucinations or tons of omissions or whatever it is like, how can we fix that? Because that's how you get it ready for primetime, is you like find those gaps and you try to plug the holes right?

Speaker 1: 32:09

Yeah. So I think there's lots of stakeholders here. I think all the stakeholders need to be involved, not just the person writing and reading the summaries. But there is accountability aspects, there's licensure, there is the whole system that's in play.

Speaker 2: 32:27

Yeah, I completely agree.

Speaker 1: 32:33

Well, yeah, it's not all gloom and doom. I mean, these results are great and they're interesting, so I think there's more to talk about. Is, I guess, what we're getting at? Yeah, so I think that's a great place to stop here, and do you have anything final to add? Final thoughts notes.

Speaker 2: 32:47

I have to say it came into this like really excited for this topic. I am still excited about this topic. I think that it could be really helpful, but I want to see kind of more progress made, like I want to see that I want to see the llms do better than they're doing right now and I understand that, like a lot of the studies show that they're as good as humans. But I don't know that that's good enough for me because the humans make a lot of mistakes and the mistakes that the AI makes in some cases, especially if it's carried forward, it could cause a lot of problems.

Speaker 1: 33:19

That's a great place to end. Thank you for joining us.

Laura Hagopian

Host

Vasanth Sarathy

Host