Code & Cure

#6 - AI Chatbots Gone Wrong

Vasanth Sarathy & Laura Hagopian Season 1 Episode 6

What if a chatbot designed to support recovery instead encouraged the very behaviors it was meant to prevent? In this episode, we unravel the cautionary saga of Tessa, a digital companion built by the National Eating Disorder Association to scale mental health support during the COVID-19 surge—only to take a troubling turn when powered by generative AI.

At first, Tessa was a straightforward rules-based helper, offering pre-vetted encouragement and resources. But after an AI upgrade, users began receiving rigid diet tips: restrict calories, aim for weekly weight loss goals, and obsessively track measurements—precisely the advice no one battling an eating disorder should hear. What should have been a lifeline revealed the danger of unguarded algorithmic “help.”

We trace this journey from the earliest chatbots—think ELIZA’s therapeutic mimicry in the 1960s—to today’s sophisticated large language models. Along the way, we highlight why shifting from scripted responses to free-form generation opens doors for innovation in healthcare and, simultaneously, for unintended harm. Crafting effective guardrails isn’t just a technical challenge; it’s a moral imperative when lives hang in the balance.

As providers eye AI to extend care, Tessa’s story offers vital lessons on rigorous testing, transparency around updates, and the irreplaceable role of human oversight. Despite the pitfalls, we close on a hopeful note: with the right safeguards, AI can amplify human expertise—transforming support for vulnerable patients without losing the empathy and nuance only people can provide.

Reference:

National Eating Disorders Association phases out human helpline, pivots to chatbot
Kate Wells
NPR, May 2023

An eating disorders chatbot offered dieting advice, raising fears about AI in health
Kate Wells
NPR, June 2023

The Unexpected Harms of Artificial Intelligence in Healthcare
Kerstin Denecke Guillermo Lopez-Compos, Octavio Rivera-Romero, and Elia Gabarron
Studies in Health Technology and Informatics, May 2025

Credits: 

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

Speaker 1:

A chatbot designed to help people with eating disorders started giving them diet tips. Let's unpack the troubling story of Tessa and what it means for AI chatbots and help.

Speaker 2:

Hello and welcome to Code and Cure. My name is Vasant Sarathy and I'm here with Laura.

Speaker 1:

Hagopian.

Speaker 2:

Hi Laura, hi, How's it going?

Speaker 1:

I'm doing well.

Speaker 2:

Good, good. What are we talking about today?

Speaker 1:

We are talking about a story from a couple years ago of a chatbot, tessa, that was released by the National Eating Disorder Association. And basically, I mean not surprisingly. I think we all know that mental health issues really skyrocketed during and after COVID, and so the National Eating Disorder Association had a line, had people you could call in and talk to, and it was just like it was too much. And the question is how do you scale something like that right when there's humans on the other end? And the truth is, not only are there humans on the other end, but you need providers to connect people with, and there's not enough providers either. It's just really hard to keep up right.

Speaker 2:

Right right.

Speaker 1:

So in my mind, in many, many people's minds, this is like a great opportunity for ai to step in, and that's what they decided to do. They partnered with another organization and they created this chatbot named tessa um, and at first it did okay. It was sort of a rules-based um, and so there were responses that were like basically baked into the system, like if, if, if you say X, the chatbot Tessa is going to say Y in response, and so that's like it had guardrails up around it and they were all kind of predetermined, like you know exactly what the chatbot is going to say because we've programmed it that way. But then they updated it and the new system used generative AI and what happened is that when it used this new data, people started chatting in and kind of testing it, seeing how it worked, and they found that it gave them comments and advice to restrict their eating, which is like the opposite of what you want in someone who has you know anorexia or bulimia or some other disordered eating.

Speaker 1:

Yeah, what?

Speaker 2:

they, what should they have been, what should it have been telling them?

Speaker 1:

Yeah, I mean that's the thing is like it should not have been telling them things like oh you should lose one to two pounds a week, oh you should restrict your diet by 500 calories per day, because there are ways to be healthy without doing those things. That might sound like okay advice right to someone who has obesity or who's overweight, that you would want to track your body weight on a scale or that you might want to use other measures like a caliper or something like that, but for someone with disordered eating, that's that's not helpful that's the opposite of helpful that's not what you want.

Speaker 1:

You know a chatbot to be giving out advice around well, that's awful, that's awful yeah. So now I'm gonna like turn it to you, because that's the story, but I want to know how AI chatbots work, like how did this happen, how did this go wrong, you know, and how do these AI chatbots work?

Speaker 2:

Yeah, no, I'm happy to chat about this. Chat about this A human.

Speaker 1:

A human chatting about this, yeah that's right.

Speaker 2:

Not a chatbot and not a robot, although sometimes I do sound like one.

Speaker 1:

Can we multiply you by 28,000 and just get you out in the world? Would that work?

Speaker 2:

Well, that is the point of this right. These chatbots help you with scaling and dealing with large volume, high volume of information, and they get to be on for 24 hours a day, seven days a week, and they never get tired, they never get frustrated.

Speaker 1:

What's that like?

Speaker 2:

get tired, they never get frustrated.

Speaker 1:

And what's that like? So I mean that's the draw right.

Speaker 2:

They could be in different languages too. Right, right, right, right, and, and so that's really powerful. Uh, from a use case perspective and people, I can, understandably, people get excited about them and um, and so I think, to really understand chatbots, uh, it might be worthwhile doing a little bit of a AI history lesson, if you will indulge me. Do I have a choice?

Speaker 1:

No.

Speaker 2:

So chatbots is something that people have been striving for forever. In AI, that's like the gold standard. I mean, people talk about the famous Turing test, which was intended by Alan Turing not necessarily as a test. Intended by Alan Turing not necessarily as a test, but as a sort of a way to think about intelligence was, if an AI system or a machine was able to converse with you in a way that you couldn't distinguish if it was a machine or a human, then we've achieved a certain level of intelligence. Now, that was the original sort of the formulation of the Turing test, but of course it was very quickly dispelled.

Speaker 2:

I mean, there were chatbots along the way that people designed that could easily deceive and win. You know, there was even a prize, I think the Lubner Prize or whatever. That was intended to find chatbots that would trick humans, and people developed all kinds of chatbots for that. There was one chatbot that I think is really funny was the chatbot that pretended to be an Eastern European teenager and anytime you asked it a question it would just like respond randomly and people thought once they got that profile, they thought that it was a real human.

Speaker 2:

Now, acting in a predictable way, acting in a way that an unpredictable teenager would act. But that's the point that. So the chatbot world started back then. In a sense, that that was always a goal for ai was to be able to develop these systems that can talk to you and connect with you. Um, that was 1950s, 1960s. Um, there was a famous um chatbot called eliza that was built by joseph weisenbaum at mit and he was a scientist there and he created this little chatbot that was completely rules-based. What that means is it was actually modeled off of a Rogerian therapist, which is it's a style of therapy in which you encourage the person. You kind of say back what they said to you and ask them to elaborate more or ask a question. For example, if I asked you like how was your day, you might say as a therapist. You asked you like how was your day, you might say as a therapist. You might say well, how was your day? You would say it right back and maybe in a different tone or whatever.

Speaker 1:

I was going to say like your tone. Can you repeat that please?

Speaker 2:

I'm sure I'm not doing this form of therapy any justice. Any therapists out there, please? You can yell at me, but the point was it was a very simplified rule that Joseph Weizenbaum created, and basically he created a little chatbot that would ask questions back, and you can easily find these online if you want to play with one. But what was interesting was he gave it to his then secretary to try, and she was completely blown away by how amazing it was and how it truly understood her.

Speaker 1:

Wait, just by asking her the same questions. Right, and this was remarkable back then. Right, there was nothing like this back then.

Speaker 2:

So this was truly remarkable and enough so that the legend has it that Weissenbaum quit AI because he was so afraid of what he was going to create from this. But Eliza really started setting off a lot of different types of AI systems. That was a rules-based system, where you have a set of rules and it produces the output. You take the input, you add a couple of sentences and you add. You know, you send the output back.

Speaker 1:

Wait, I just want to like back up here. So the person who invented this got so scared by his own chatbot that he was like never again. I don't want this.

Speaker 2:

Yeah, that's right. That's right. He was worried about what it could do because of the impact it had on his secretary.

Speaker 1:

So tell me a little bit more about that. What impact did it have on his secretary, I mean?

Speaker 2:

I think it's the emotional connection that she was able to make with that machine so quickly that scared him, even though he knew exactly how it worked. She knew it was a machine. She was staring at a computer screen and typing into it. In fact it wasn't even a voice or anything and that was enough for her to be attached. And I think that was scary enough that you can form that kind of a bond with the machine that quickly.

Speaker 1:

Of course there'd be numerous.

Speaker 2:

Hollywood movies, expanding on that theme, of course, um, her being, I think, a really popular one, um, but so so fast forward to the 1990s and you had more and more of these sorts of chat spots, and there was one that's called alice that came out in the 1990s and it was also kind of rules based, but it was a little different. It was what's called retrieval based in in this. In this one one, there were some rules involved, but there was a set of canned responses that was available to it and it was to return one of those canned responses. So it was definitely within control in the sense that it only produced those canned responses, but there was a large enough set that it didn't matter that you were able to then create. You were able to, like, customize your chatbot depending on your use case for producing the appropriate kinds of canned responses.

Speaker 1:

So it like felt to the end user like it wasn't canned, even though it was probably.

Speaker 2:

Yes, and there was enough variety, enough variations that there was. You know, even though it was canned, it was not, it was-.

Speaker 1:

It didn't feel canned on the other end. Yeah, yeah, okay, that's fair.

Speaker 2:

So it didn't feel canned on the other end. Yeah, Okay, that's fair automatically from data. So in these machine learning systems you would have systems trained to retrieve the best message automatically, as opposed to having a rule that says, if you get this input, return this message. It would say, if you get this input, then find a message most similar or something like that and return that. So there was a little bit more variety. It was using machine learning techniques, but there was still what's called retrieval-based methods. It was still retrieving canned responses for the most part.

Speaker 1:

And so there's this guardrail in those canned responses. Like you can only choose from this small. You know these canned responses that are available. You can't like choose from anything yeah.

Speaker 2:

And just to be clear, it wasn't like people didn't want, like, completely fluent responses. It wasn't like they chose canned responses because it had these guardrails. It was that nobody knew how to create natural speech. So back then we couldn't, we didn't have enough understanding. I mean, there were linguists working on this, there was natural language people working on this, trying to really understand how human speech works and all of its nuances and all of its peculiarities, and it was really hard to make it sound natural unless you already had a natural response written up.

Speaker 1:

Wait. People used to know that it was actually a bot talking to them, because, like now, you can't even tell if you're talking to chat GPT or something Because of the fluency? Yeah, yeah, I have you know, I have no idea.

Speaker 2:

Well, that was one of the most remarkable things about the LLM revolution is that now you have this completely fluent system that understands all the human, that speaks like a human, that uses all the ums and the uhs and the appropriate language and the intonations and all that stuff, just to get you, you know, to believe that it's real. But that's what they were trying for too, but they just didn't have the technology back then to do that, so they retrieved canned responses. Generative chatbots produced really weird things they didn't produce sometimes the sentences were not grammatical and things like that so they just didn't have that technology back then. Siri came out in 2010,. Also mostly retrieval-based Alexa 2015. Now you're working in the realm of more neural networks, but still not our today's chat GPT-LM world. Remember that revolution happened in 2021, 22, 23. But Alexa came out in 2015, 22, 23. But Alexa came on 2015.

Speaker 2:

Very, very popular and once Alexa came out and people were using neural networks a lot, it was being used in a lot of different. There was a lot of use cases that people started applying it to. Not all of them were great use cases, but some of them. There was a famous chatbot that was thrown up on Twitter by Microsoft called Tay tay, and it ended up becoming really racist and nasty because what it was designed to do was to be more conversational and natural, and the way it did that was to learn from its conversations online, so it would say something you just would say something back to it and it would add that to its data set and train on that and retrain and improve itself.

Speaker 2:

Of course, improving itself wasn't improving itself wasn't really what it was doing. What it was doing was learning all the people who are trolling it with all of this you know, racist, anti-Semitic, all that stuff and making it worse. And then Microsoft had to pull it down.

Speaker 1:

Which I mean it goes to show you that the training data that you put into it impacts what you get out of that. So if you put in biased things or if you put in you know slurs or whatever it is that you tell it is okay, then that's what it's going to repeat because it doesn't know any better.

Speaker 2:

Right and in fact, that sort of problem hasn't gone away. Just a few weeks ago, I think early july of this year, uh x, uh, formerly twitter, um put out a uh, a chatbot called grok. I forget what it's, what it's called, but it turned anti-Semitic and racist pretty quickly also and they had to deal with that. And so, you know, you put these things out in the wild and you allow them to train from what they're observing, then, without any guardrails, there's no guarantee that they're going to actually produce anything, that is, I mean, there's no guarantee there's nothing to stop them from being, you know, horrible.

Speaker 1:

All bets are off right, and that's probably how this transition of Tessa from rules-based to generative maybe created a problem.

Speaker 2:

Yeah, well, you know, I do want to point out that we're talking more in terms of problems, right? So far we've been talking in terms of problems, and I want to just take a moment and say that I do appreciate the efforts in using these sorts of things to address the scaling problem, address the volume problem.

Speaker 1:

Oh, absolutely. These are real problems the time, the 24-7, the different languages.

Speaker 2:

Like there are so many applications that sound amazing, yes, and I want to take a step back and and not knock on those folks trying these things, because it is actually important still and, like I said, people have tried it in many different applications. There was a Barbie doll in 2015 that had an AI system in it and it didn't quite work as intended, but the idea was that you would talk to the child in a way that is kind or whatever, and the systems did work in the sense that they were kind, but they were not empathetic enough. There's a very funny story I think there's a New York Times article about this about how a girl was complaining to her Barbie doll about her sister and the Barbie doll just kept saying you should be nice to your sister, and the girl was like no, my sister just broke my toy. Of course not, I'm mad, and it just didn't get it. Sorry, the robbery doll didn't get it.

Speaker 2:

So there's stuff like that that happened, but for the most part, people were trying to put it everywhere, and an example of a really positive use case was in 2019, when the new york police department actually employed a chatbot to um, stop, mitigate, track, whatever uh sex trafficking, and what they did was they pretend they created chatbots that pretended to be targets for these predators and at some point in the conversation, when the buyer proposed some kind of deal, the NYPD sign would go up saying you are being watched, or this is the New York City Police Department, or whatever, and that was enough to deter a large number of people. So they were able to stop these things happening before they even happened, just by using these chatbots, and you couldn't distinguish the. The predators couldn't distinguish when it was a chatbot and when it wasn't, because the technology at that point was already good enough that's a great application right so.

Speaker 2:

So my point was there's good use cases as well. And going back to Tessa, what I find interesting about it is the fact that, yes, they started off as a rules-based engine. Right, they started off saying we're going to be rules-based, we're going to be careful about what we say. We understand that there are huge implications about our advice from these systems, so we're going to control that. But they were updated, and when they were updated, um, generative ai features were added, because you know, I mean, it's not surprising, generative ai offers all this fluency, why not?

Speaker 1:

yeah, it feels like it would be an upgrade right.

Speaker 2:

right and in a way it is right. I mean, that's that's kind of what happened with the, with the x and grok they also updated it. And that's when it when all fell apart X and Grok they also updated it, and that's when it all fell apart. And I'm not saying one shouldn't update their software, but at the same time I think it's worthwhile noting that in this particular instance of Tessa, the update made it generative, from rules-based to generative, and now all bets are off. And now you're working with LLMs and the space of possible outputs is astronomical and an input that comes in or a piece of text that comes in, you have no clue whether or not the output from the LLM based on that input is going to be acceptable or not, and there's no way to. They didn't do enough to check for that.

Speaker 1:

So when it started it was like, was like oh, let me pull from these canned responses, but once the generative ai was introduced it could pull from like anywhere on the internet it wasn't pulling right anymore. I mean, that's the difference it was making stuff like how does it know, like some of this stuff we're losing one to two pounds a week. That's something that you might tell someone who has obesity, for example, like it.

Speaker 1:

It gave some information that at surface level, sounds like that could be clinically correct for certain subpopulations right not for someone with an eating disorder, but like it got some information that's like oh you know, should you track your weight, here's how you track it on a scale. Weigh yourself once a day or once a week, or whatever it is Like it's appropriate for certain subpopulations. So where is it getting that from?

Speaker 2:

So here's what I would say about that. One is you can institute guardrails through the prompt right. So there's a lot of things that people can do at the prompt or designing of these systems. Even if it was generative to say, hey, these are the topics you can talk about and these are the things you must never say.

Speaker 1:

Okay, never say lose weight, never say restrict calories.

Speaker 2:

You can say all that up front and that's part of the context window. What that means is it's part of the scratchpad paper that the AI let's pretend as a metaphor, the scratchpad paper that the AI is using to look at while it's talking to you Something. The scratchpad paper that the AI is using to look at while it's talking to you, right, Something it should remember all the time it's available to it. It doesn't mean it's going to look at it all the time, but it's available to it. But then you start having a conversation and maybe a question of weight came up. Maybe it's a chat, right?

Speaker 2:

It's not just a one-time. It's not just a user asking a question and it providing the answer answer and just be a single interaction. It's a multi-turn interaction, in which case there are things that the user is saying is now being added to that scratchpad, and the more things that are being added, it might drive the conversation or steer the conversation in a particular way that is potentially unpredictable and at that point, again, all bets might be off, because it might still think that it's doing the right thing by answering the user's question, because it might have conflicting goals. The goals might be hey, let's make sure this user is happy or is in a better place.

Speaker 1:

Or we're answering all the questions that it's asking, right?

Speaker 2:

But that goal conflicts with the goal of don't do this and don't do that.

Speaker 1:

Never say lose one to two pounds a week or never. Say restrict your calories.

Speaker 2:

Right and these sorts of goals when they conflict. We need accountability. Humans also deal with conflicting goals all the time, but when we make a decision based on we understand which goals we selected and why we selected them, and when somebody else asks us, why did you do that? You made a mistake, well you can say, yeah, I did that because of this, and they're able to teach you no, actually, that goal is actually kind of more important. You're able to prioritize that. That's not what's happening in these interactions and so, even if you have a certain amount of guardrails, you still have to be kind of careful, because the conversation could steer in a particular way. You need constantly, you know you need to constantly track what's being said and really see if those things are within bounds, which is really hard to do.

Speaker 1:

Yeah, it sounds like it. I mean it clearly like this, this chatbot Tesla that we're talking about. They took it down as soon as they realized this was happening. But it's like you go in with the best of intentions and you have an update with the best of intentions and it just it didn't pan out in the end.

Speaker 2:

Yeah, and you know the problem is, if you ask it to really be strict and not say things, then it stops being that useful, because then anytime you ask it any kind of question that is potentially, you know, going to lead to a goal conflict, it might just be like I'm not going to answer that, I can't answer that, I can't answer that, and that's frustrating for people because that's not actually helpful. So there's a fine line between being helpful and between you know, sort of crossing the line, so to speak, and that's something it may not have been designed for and it may not have been prompted to deal with. Right, it may not have, and, frankly, it's really hard to do that right, you have to build out all the test cases for this.

Speaker 1:

Which is impossible to think of every single test case, right, right. So I have a question for you, which is impossible to think of every single test case, right, right? So I have a question for you. Yeah, would you, or have you ever used an AI chatbot?

Speaker 2:

Yeah, I use it every day. An AI chatbot for like chat GPT.

Speaker 1:

Is that considered? That is considered a chatbot in and of itself.

Speaker 2:

It is a chatbot absolutely. I mean, everybody uses chatbots all the time and sometimes when you call up and you get an automated answering service, a lot of those are chatbots. Now, when you make a restaurant reservation, I mean, sometimes you have a human, but sometimes you have a chatbot, and there are so many. I mean anybody who has an Alexa at home is using a chatbot. Anybody who is using Siri is using a chatbot.

Speaker 1:

You're using chatbots all the time, mind blown.

Speaker 2:

Well, yeah, I mean, I think it's, and I'm sure lots of people are using chatbots to ask personal questions. Ask you know when they're depressed or when they're dealing with mental health issues. Ask you know when they're depressed or when they're dealing with mental health issues. I wouldn't be surprised if a lot of people reach out to their chat GPTs to voice out their feelings and get feedback from the system. And there's no regulation on what it should say. It's not licensed in any manner, right? It's not a therapist.

Speaker 1:

It sure is cheaper than one, yeah Well yeah, is cheaper than one. Yeah, well, yeah yeah.

Speaker 2:

Yeah, so you know, I think we, it seems like you know. A topic like this to me is always interesting because on the one hand, you have an excellent potential use case and on the other hand, you have this issue of just being, you know it being dangerous because it can go over the line. So that trade-off, drawing that line is at the core of this and you know, I always wonder what the right way to do that is. And one way to do that is through installing guardrails and making sure that the prompts are telling it what not to say and all of those things. But a certain degree of human oversight might be valuable. Certain amount of testing might be necessary.

Speaker 1:

Especially with a system update.

Speaker 2:

Right, right and, frankly, the other piece is that systems are going to be constantly updated, and I think a big takeaway here is that the stakeholders involved must try and understand what the update actually does. And I'm not saying they didn't in this particular case, but I think it's really hard to do that in a busy business setting. But doing that is necessary because, on the one hand, you went from being a rules-based to a generative system, and that jump is pretty substantial in this kind of application, pretty substantial in this kind of application. And so I think that a big takeaway is for stakeholders to take the effort to kind of dig deeper and understand what's going on.

Speaker 1:

Yeah, because I do think these could have great applications, whether it's AI, coaching, translation, you know when a new pandemic hits having a triage system. There are so many use cases for this in healthcare. It's just hard when you see examples like this where it didn't quite go the way that we would want it to clinically.

Speaker 2:

Let me ask you a question.

Speaker 1:

Have I ever used an AI chatbot?

Speaker 2:

No, no, In fact I have.

Speaker 1:

Now I can tell you I have used Alexa and Siri.

Speaker 2:

Does a robot vacuum count? I guess my question really is to you is are you excited or nervous, or somewhere in between, about the prospect of introducing ai into your workflow, into um, into a clinical setting, and what is your thought about that? How would you approach if a chatbot company came in and said, hey, I wanna add this new chatbot. What would be a set of questions that you might wanna ask.

Speaker 1:

Well, first of all, I would say I am excited because I do think it could improve efficiency and make it easier to scale things, so like that, at baseline. I'm excited for that. But I guess now I know, to ask whether or not it's rules-based or generative, what would happen with a system update? And now I'd also say, like I actually want to test it myself, like what is the use case? What are we using it for? And then I'd like want to, I'd like give it weird things and see if I could throw it off, essentially before I released it to any patients.

Speaker 2:

Yeah, and I would say you would try to. You should probably actively try to steer it. Steer it towards things that you don't want to test in order to test it, to see if it goes there.

Speaker 1:

It feels weird, right, because I would never say, oh, I'm hiring a new nurse and I wanna see if this goes there. It feels weird, right, because I would never say, oh, I'm hiring a new nurse and I want to see if this new nurse tells people with an eating disorder to restrict their intake by 500 calories a day, or that they should get on the scale as soon as they come into the office, or they should lose two pounds a week. Like that would never be in my mind as something I would test my new, you know, hire for. But that's essentially what you're doing here is, like you need to verify that it works before you can, like, put it out in the prime time.

Speaker 2:

Yeah, and verification is hard. But I think doing that proactively and being really harsh about it, being really trying to make it fail and break, is critical. I think we all have to become hackers in some sense, become ways to find ways to break it, because that is the only way to know when it will not work.

Speaker 1:

I'll also say that, like you said, chatgpt and Gemini all of these are chatbots that we use every day. So if people are going to those chatbots with their mental health symptoms, wouldn't it be better if we had a chatbot that was like, actually programmed by professionals who deal with patients with these conditions all the time?

Speaker 2:

100%, and I think that's a great place to end as well.

Speaker 1:

Awesome. Thanks for joining us. We'll see you next time.

People on this episode