Code & Cure

#29 - AI Hype Meets Hospital Reality

Vasanth Sarathy & Laura Hagopian

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 25:45

What really happens when a “smart” system steps into the operating room, and collides with the messy, time-pressured reality of clinical care?

In this episode, we unpack a multi-center pilot that streamed audio and video from live surgeries to fuel safety checklists, flag cases for review, and promise rapid, actionable insight. What emerged instead was a clear-eyed lesson in the gap between aspiration and execution. Across four fault lines, the story shows where clinicians’ expectations of AI ran ahead of what today’s systems can reliably deliver, and what that means for patient safety.

We begin with the promise. Surgeons and care teams envisioned near-instant post-case summaries: what went well, what raised concern, and which patients might be at risk. The reality looked different. Training demands, configuration work, and brittle workflows made it clear that AI is anything but plug-and-play. We explore why polished language can be mistaken for intelligence, why models need the right tools to reason effectively, and why moving AI from one hospital to another is closer to a redesign than a simple deployment.

Then we follow the data. When it takes six to eight weeks to turn raw footage into usable insight, the value of learning forums like morbidity and mortality conferences quickly erodes. Privacy protections, de-identification, and quality control matter—but without pipelines built for speed and trust, insights arrive too late to change practice. We contrast where the system delivered real value, such as checklists and procedural signals, with where it fell short: predicting post-operative complications and producing research-ready datasets.

Throughout the conversation, we argue for a minimum clinically viable product: tightly scoped use cases, early and deep involvement from surgeons and nurses, and data flows that respect governance without stalling learning. AI can strengthen patient safety and team performance—but only when expectations align with capability and operations are designed for real clinical tempo.

If this resonates, follow the show, share it with a colleague, and leave a review with one takeaway you’d apply in your own clinical setting. 

Reference:

Expectations vs Reality of an Intraoperative Artificial Intelligence Intervention
Melissa Thornton et al. 
JAMA Surgery (2026)

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/


SPEAKER_01:

When it comes to AI and healthcare, are we overestimating what AI can do and underestimating what humans still have to do?

SPEAKER_00:

Hello and welcome to Code and Cure, where we discuss decoding health in the age of AI. My name is Vasant Sarathi, and I'm an AI researcher and cognitive scientist, and I'm here with Laura Hagopian.

SPEAKER_01:

I'm an emergency medicine physician and I work in digital health.

SPEAKER_00:

I'm I love this topic about expectation because we all, you know, with any device, any piece of technology, we start with some expectation. Maybe it's marketed to us in a particular way. Maybe we've seen a commercial on TV about all of its promises or all of its capabilities, and we build in our heads some model of what it can and cannot do, right? And for AI, it's no different. Um, when we hear about it on TV, when we hear about it from friends, when we hear about it, you know, on our social media feeds, we immediately think it can do certain things. Sometimes we even see it right in action. And, you know, it can create these amazing magical videos and things like that. And you you immediately begin to believe all the other things that it can also do, right? And I think that to me is one of the biggest challenges right now is the mismatch between, at least what I think of, as a mismatch between what it can actually do and what people believe that it does already. And, you know, and so I think that's that today's topic is all about that. And I'm really excited about it.

SPEAKER_01:

Yeah, I mean, I think it's uh it's interesting because when I see a new tool, especially when it's marketed well, I'm like, oh, it's magic. Like it just does it for me. But then when you go to use it, sometimes you see like, hey, it's not, it's not quite so magical or there are issues. I don't know. Like my Fitbit is a good example where, okay, it's tracking my sleep and my steps and my activity and all of that. Um, but one, I have to tell it when I'm working out. And two, for some reason, my Fitbit like disconnects from the phone app every time I charge it. So I have to like re- so I have to like re-connect it with the Bluetooth every single time. So I'm like, okay, well, it was supposed to be this like seamless, easy thing. And now it requires all of this user intervention where I have to relink it every time I charge it.

SPEAKER_00:

Yeah, and that's not even speaking to its actual functionality. That's more of its just like operational challenges, right? But that's like always half the half or maybe three quarters of the battle is operational challenges. Absolutely, sure. And and you know, I I'm what I'm saying is like AI has this additional piece, which has this sort of ethereal or kind of mysterious mind aspect to it, which then opens up the possibility that people begin to think that it's like a human mind, right? It can do all the things. Or even better.

SPEAKER_01:

It's even better because it's magic.

SPEAKER_00:

That's right. And that's, I mean, that in some sense is exciting, but in some sense, it can be very dangerous if the expectations are mismatched.

SPEAKER_01:

And I think this article was a like it pointed that out, essentially, that we expect it to do X, and in reality, it does Y. So, what did the article talk about? So the article um, they went into a couple of different academic centers and they basically um put in like video and microphones into surgeries, and they um basically allowed it to flag cases for review at morbidity and mortality conferences. They allowed it to do like automatic safety checklists for safety protocols in the operating room, et cetera. And so then what they did was they said, okay, let's take the information from the AI and also and like see what the output was and also interview the surgeons who were using it and see if what the output they got like matched what they thought would happen. Like, did the reality of what output they received actually match what their expectations were?

SPEAKER_00:

And this device was it was a literally like a like a listening device, right? It was like a in video too.

SPEAKER_01:

There's like audio feeds, there were video feeds.

SPEAKER_00:

It was just sitting there listening, watching the surgeons perform their surgeons.

SPEAKER_01:

A little big brother-ish, not gonna lie.

SPEAKER_00:

Yeah, that's kind of what I was getting at is they're aware that this thing is there, observing them. And maybe in the moment they're not gonna remember that specifically as they're focusing and concentrating on the surgery, but but it's still there and listening to everything that they're saying and also watching everything that they're doing, and somehow bringing together all the different channels and and and providing some insights, like you said, either for uh morbidity and mortality discussions or for safety protocols or for just general training, right?

SPEAKER_01:

Yeah, and I think what you're pointing out, and this wasn't really the point of the study, is that like people are gonna act differently when they know they're being watched than when they're not being watched. For example, I might swear in an operating room, but I probably wouldn't if I knew I was being like recorded.

SPEAKER_00:

Yeah, I'd be curious if they just forgot, you know. I bet I bet that would happen. Like once you're in the surgery for so long, at some point, maybe if it's not in if it's not invasive in any way, if it's not in your face, you might just forget that it's there. Um, but that's an interesting question too. Like, how do they act differently?

SPEAKER_01:

But I I do think what the study then got into is like, hey, um, let's talk to the surgeons, let's talk to the people who implemented this whole audio video system, and let's figure out whether their perception of what the system would do was different from the reality of what the system actually did. Yes. And they identified four major areas where their the expectations and reality really did not line up. Um, and I think we can just kind of walk through those four areas because I think it's it's interesting to think through. And there's some like cool quotes too. Um, but one of the top things that came out was like that the AI needs more training. And of course, training requires human oversight, right? Um, but people kind of there are there's even a quote from what we were talking about earlier about magic. Like they expected the technology to like magically know what we needed. Um, which I mean, I I think that's like a common expectation of AI. But the thing is, in order to get the data that you want, you kind of have to like tell it. You have to like tailor it. And that tailoring part, having done some like prompting myself, I know that that takes like a lot of energy and effort. And over time that can help scale, but like you're putting up a lot, you're putting in a lot of upfront work in order to get it to do kind of what you want it to do. And I think that was a key theme that the AI really did need more training, that it wasn't automated enough, that they had to like tell it what to do more.

SPEAKER_00:

Yeah, that's interesting because presumably they did do some training, right? It wasn't completely untrained. And presumably people who put together the AI device um, in fact, did prepare it for doing the things that it's supposed to do, right? But it wasn't consistent, maybe, or was missing details that the the surgeons thought that it would not miss.

SPEAKER_01:

Or, like we've talked about in other episodes, like it might have been trained at one hospital and you bring it to a different hospital that has different processes and procedures, etc., there's gonna be some overlap, but it's not exactly the same. And so you're like, you've trained it on data set A, and now you're using it on data set B, and it doesn't quite work exactly the same.

SPEAKER_00:

Right, right, right. And you know, in general, I think, you know, we might get to this in one of the other themes as well. Um, is that I think there's an expectation of what it can do at a very fundamental level, in addition to the training, the capabilities that it can have, right? Again, it's the fluency of language and all of these things suggest that it has a level of intelligence higher than what it might actually have. So even when trained, it might not be able to do things beyond what you think it can do, right? Just because it's talking about certain concepts doesn't mean it understands those concepts as well as you do. And that's where I think the training piece can be really challenging as well.

SPEAKER_01:

Yeah. And there's an interesting quote from one of the surgeons where they actually said they wanted to get an email right afterwards, right after the surgery that said, Hey, you know, this is what happened during the surgery. Um, here are some of the issues, this is what went well, this is what you could improve on. Watch out for this post-operative complication. They had an expectation that the system could do more than the system could actually do, right? They wanted X information out of the system. And instead, they had to wait a long time, which we'll get to in a second, and then they got Y information out of it. And so it's like, well, what was the system designed to do? Because it's not going to do the extra stuff that the surgeon wanted unless it's trained to do that.

SPEAKER_00:

Yes. And it's not just trained to, it might not be configured to do that, right? Right. This might also just be a usability thing that the designers of the system didn't get all the requirements of the customers or the users of the system in order to properly design the interface and properly design the, you know, the set of features that are important to them, right? It could also just be that, but but in but there are layers of this, right? There's the usability layer. There is sort of people that people who are designing products are aware of, independent of whether it's AI or not. Any product, you want to make sure there's an alignment between what it can actually do and what the user actually needs.

SPEAKER_01:

Well, that's it, right? Like you kind of have to interview the user. You have to be like, hey, right, as a surgeon, what would you like uh, you know, an AI video audio system in your operating room to do? And they might say, I want it to do my safety checklist for me, or they might say, Hey, it would be really awesome if it uh could help predict complications and flag them and what I could do to prevent them. Yes. And those are very different problems to solve, right? Exactly. And I think this does speak to the importance of getting that clinical input at the base level, yes, in terms of deciding what this project should actually do. Because if you want a surgeon to implement it, and it takes work to implement it, which will, you know, which we've which we've seen, then it needs to have the functionality that the surgeon would want to have it to have.

SPEAKER_00:

To me, that first takeaway then is really that we want to incorporate subject matter experts in the design process of these systems early on, and that's not happening as much, right? People are working in their own little tech silos and building these tools, thinking that they'll be useful, and then trying to work with uh with doctors and so on. But it it's almost like you want it, you want the doctors and the users at the ground level in helping design these systems.

SPEAKER_01:

Right. And I think a lot of times in the digital world, we talk about okay, what's like the the minimum viable product, the MVP? But it's like, what's the minimum clinically viable product, right? It's like you want it to be, you if you want it to be used clinically, you need the clinician. Yeah, you like that? You like that acronym? It's not quite as catchy as MVP, but I I think that does speak to the importance of looping, looping clinicians in earlier, especially if they're going to be the end users. And that's true of anything that you create, right? You want to get that user input early on. Um a second thing that came up, and I did not anticipate this one, was it was like actually very hard to get the data. So, you know, there was a lot of data that was being imported in audio, video, they used um some uh, you know, like monitoring equipment in the operating room.

SPEAKER_00:

And so even on like the uh surgical instruments, right? There were like cameras in there or something. I f I forget what that what that was, but I thought there was like not just sort of general cameras, but very specific, like I'm gonna use the wrong word, but like lacros lacroscopic lacroscopic laparoscopic laparoscopic. Yeah.

SPEAKER_01:

And so there's a ton of data there, right? And so, you know, the expectation was like, hey, it's AI. I should be able to get this information back quickly.

SPEAKER_00:

Yeah.

SPEAKER_01:

Right. Uh, and the reality was that um that it would take six to eight weeks to get the information back, which seems like a really long time to me.

SPEAKER_00:

Six to eight weeks seems really long. It seems like that involves human intervention. It seems like humans were involved in in pre-processing the data, in deciding how to aggregate them, how to compose them, deciding, you know, all of those steps, which which may be necessary, right? I mean, and and that's that's that's the sort of hidden piece here. It's not like the AI is magically just absorbing all the data and deciding how to correctly use each piece. There's a huge human element here.

SPEAKER_01:

Right. And so if you I I sort of started to read through like, why did it take so long? Um, because a lot of times you want to reflect, especially if a case didn't go well, if you want to bring it to a conference and talk about, hey, what could we do better? You want to do that like within the week so that everyone remembers what happened and can make comments about it. So six to eight weeks is like, you've forgotten about it already. Um, so what had to happen here was the surgeon had to actually say, like, this is the case that I want to review. They had to like send a message to the vendor to get the video. Then that vendor had to um get like get the video and de-identify it so that there was no, you know, personal health information in there. And then it had to be reviewed for quality control. And only then could the surgeon then go back and look at it for uh, you know, learning points. Got it. So there were all these sort of barriers and time constraints along the way that made it very labor intensive and very slow. Yeah. And usually when there's an MM conference, it's like morbidity and mortality conference. It's usually pretty quickly after the case happened. Yes. So that it's fresh in everyone's mind. It's not two months later.

SPEAKER_00:

Right, right, right. Oh, that's very interesting.

SPEAKER_01:

Yeah. So I think in even the paper comments on, you know, it caused some frustration because they wanted to use this intervention to review their cases and to improve their performance. And it really was not a seamless experience there. Um, and then they were like, well, what's actually being done with the data if it's just like sitting around? Yeah. So that was number two. The third area where there was a big disconnect was around post-operative complications. And this kind of goes back to the point we were talking about earlier, where maybe the system was not actually built to do this, but if you talk to a surgeon and you're like, hey, what do you want about what do you want a video and audio system in the operating room to do? One of the main things they said they wanted it to do, and it wasn't doing in this case, was to improve post-operative outcomes. So, like after surgery, you want to say, like, and did anything happen inside the surgery that could explain any sort of complication that might happen after the surgery. For example, can we predict which patient may go on to develop a pneumonia after their surgery? And if so, is there anything we can do to help prevent it?

SPEAKER_00:

Yeah.

SPEAKER_01:

Now, post-operative complications are not super common. So it might have just been like a small sample size here.

SPEAKER_00:

Yeah.

SPEAKER_01:

Um, and also there's a lot that goes into creating predictive models, right? And I don't think this that's not that's not what this was for. Um, and so maybe there needed to be some education around, like, hey, this is what this is supposed to do. But also, you know, if this is something that's important to the surgeons, maybe that's something that should have been considered when this was piloted.

SPEAKER_00:

Yes, yes, that's right. Or or somehow set the expectation that the surgeon's not expecting that to be the case, that that feature is available for them.

SPEAKER_01:

I would want that. Like, that's a feature that I would point out right away. Like, who's at risk and what can we do to reduce the risk? Because what you don't want is you don't want someone sticking around in the hospital getting sicker and sicker after their surgery. You want them to do well and you want them to go home. So if we can predict who is at risk, then we can try to intervene on that. And that's like a key area where maybe AI could help.

SPEAKER_00:

Yeah. Yeah.

SPEAKER_01:

But in this case it didn't.

SPEAKER_00:

Right. Very interesting.

SPEAKER_01:

And then the fourth area was that um people wanted this AI intervention to get give them more data that they could then use for academic uh purposes, like research quality improvement projects, and it just didn't do that either. So the whole idea is like, hey, can we use this information to um, you know, publish things to improve you know, further our careers? Can we uh gather insights in a less time-consuming way, right? Can we disseminate that information? And so they wanted more deliverables basically from it, and I and they didn't get that out of it either.

SPEAKER_00:

Yeah, I mean, that's that's not surprising to me either, because again, it's a question that's to me, that's again an operational question, right? Um, that it's got this element of uh I have all of these things that that that I would like, and this is what this provides, but I don't have uh I don't need everything that it provides, but at the same time, I need all these other things that it doesn't do. And some of those things are like again very system-based. That is the builder of these systems being taught to talking to the users of the system and deciding what features are important, what features are not. But I think I I want to take a moment to kind of just like reflect on uh the AI portion of this. Yeah, not the data portion necessarily or the data collection or any of those things, but just the expectations we have with our today, our AI systems as they stand today. When we use Chat GPT, when we use all of these systems, what are we expecting? Miracles. We're expecting miracles. We're often going in there, kind of like hoping to do offload some of our own thinking and have it think for us so that we don't we can be even more creative, even more efficient, right? That's kind of the thought process. And I want to contrast this with expectations we have with other things in our house, calculators, for example, or vacuum cleaners. We don't have all kinds of crazy expectations with these devices. It's very focused. We go to the calculator to calculate, specifically at, subtract, multiply, divide, and do some things like that. Uh we don't go to the washing machine expecting something strange. We expect a very focused application of doing the one thing that it's supposed to do, right? Um with AI, it's weird because now we've released this general purpose tool that uh we don't have any set expectations. The only expectation is that when we write something in, it responds to us, it talks back to us, right? And so everything else is a mystery. And uh, we're advertised on constantly advertised to about all its wonderful things. It can write emails for you, it can do this, it can do that. But all of that is very vague. All of that does not constrain the set of uh features that it actually can do and can't do. And the simplest example I'll give you of this is that it can't actually do numeric calculations. So you can give it simple one-digit, two-digit type additions and it'll do that. But once the numbers get bigger and bigger, uh, it is not going to be able to do that. So even simple calculations, which you would expect a quote, intelligent system to have, it won't be able to do. And that's fundamentally a restriction of the way it works, right? We've talked about this before, where it's predicting the next word in a sequence of words. So to it, a math problem is just another sequence of words, not a real math problem, not a construction of, you know, if I give it 25 plus 25, that aspect of the addition of two 25 objects things to create 50 is not a thing that it how it thinks about those that question. It thinks about that question as a sequence of you know, words that you have to then produce the next word. Now it so happens that it gets a lot of these math problems correctly.

SPEAKER_01:

I was gonna say, I'm gonna type 25 plus 25. It'll probably get to you. I bet it's gonna get it right.

SPEAKER_00:

It will probably get it correct because again, that's a small number and You've it's probably seen that in the data quite a bit. And and so I will not be surprised if it gets that correctly. It did. It's okay, good. But but there's been lots of studies on this about its ability to do basic math. And once you advance the number of digits, it gets more and more and more difficult for it to do it. And again, that's an expectation, right? You go into Chat GPT, the average user is going to expect it to be able to do math because that's the thing that machines are better than humans at, right? In general. And the idea that a machine can't do that is already problematic. And we start to rely on that. Now, a system like Chad GPT, for example, is much more sophisticated than an average large language model. So when it's given a math problem, what it might do is then reach out to a calculator, have it calculate the right number, and then um report back the answer. That's better than just for it to guess the next next letters, right?

SPEAKER_01:

Right, that makes a lot more sense.

SPEAKER_00:

So it's using a tool like a calculator effectively. And a lot of um you use see the word agents mentioned a lot. And AI agents are that, right? They're the core large language models, plus all these other tools that they have access to. Now, a calculator is a nicely defined tool, but what are what about a tool for like evaluating a surgery? That that's not that straightforward. The type of reasoning that a human that a human would do when looking at a video, for instance, if you if if a human were watching the surgeon perform, they might look at things and have a reasoning process for how, you know, predictive reasoning process for how things could go wrong or whatever, that whole model is not necessarily available to the large language model. That might be a separate tool call that it might need to do, but it doesn't have that tool. And if it doesn't have the tool, it's completely guessing the next next set of tokens. So now you have a system where you don't know what capabilities it has. It's speaking to you in this fluent way, and therefore you assume that like a human, it kind of thinks like us, so it must be able to do this. Plus, it's a big machine, so it probably can do a lot of math, plus, it can do a lot of these cool things. And so your expectations start to rise quite a bit. And we see this repeatedly, with like I said, with linguistic fluency, it gives you this false illusion of competence that it has. Uh, we see this repeatedly in the robotic space as well. There's a lot of research on this about agency, about robot robots having eyes, right? You think that's nothing. You put a little eye on the robot. But eye gaze um gives the humans the sense that they have that the robot has a certain degree of agency, is looking at you. And when it's looking at you, it's looking at you like a human would look at you. And it's not, it's a camera, and the camera has some limited set of capabilities. But but you assume that it has all of these other things that it can it can't actually do. And people have researched this extensively, and I think the same sort of idea applies here too.

SPEAKER_01:

For sure. Like, you know, I when we look at what this system did, it did do things to improve patient outcomes, and it did do things to improve team performance when effectively delivered, right? But it didn't do all these other things that you know, people had these high expectations that the system could do, and it just like wasn't programmed to do that. It didn't have the information to do that, it had the information to do a safety checklist. Yes. And like it had the information to do a couple other things, but didn't have the information to predict post-operative complications, it didn't have um the ability to turn things around quickly, and so especially when you get into this more specialized stuff, it's like you have to like give it, give it the knowledge of what you want it to do, which takes like time and energy and implementation efforts to do. Yes, exactly.

SPEAKER_00:

Exactly. Yeah. So I I think this question of expectation is going to be ongoing because I think all of us using AI systems have to be conscious, constantly aware that it's an AI system, it works differently from humans, and just because it does something really well doesn't mean it can do all the other things that you assume it can do really well as well, right?

SPEAKER_01:

I have to say, like sometimes when I use these systems and they mess up, I'm like, that came out of left field. Like I had no idea that that error was even possible. It wasn't even the realm of my thinking. Like, no human would have made that error. Existing. It's just so weird and random. But then you go back to like, oh, how is the system actually working? And the fact that it's, you know, sycophantic.

SPEAKER_00:

Yeah, sycophantic, uh, which is just for those who don't know what that word means, is just a fancy word for saying it's trying to please you.

SPEAKER_01:

It's trying to be helpful over anything else.

SPEAKER_00:

I'm just proud that I pronounced it correctly. There you go. Um, but it's also an average, right? It also represents the average. It's an average of all the knowledge, right? It's not like focus specialized knowledge that a surgeon might have, right? It's it's this average collective human knowledge.

SPEAKER_01:

Yeah. So I do think this is interesting, and this is not a topic that's going to go away because we don't fully understand how these things work, and because there is a black box, there is this element where we think that it's magical or we think that it's almost miraculous, right? And the reality is that it doesn't do what we expect it to do in so many scenarios. And if we want it to do those things, it requires a lot of human energy and effort to get it there. Yes. And that is very interesting because we expect this very quick turnaround and that it's going to improve efficiency. And it may not do that out of the gate. Yes. So I think we can end there, and we'll see you next time on Code and Cure. Thank you for joining us. Bye bye.