#41 - If You Cannot Trace The Data, Do Not Trust The Model Artwork

Code & Cure

Decoding health in the age of AI

Hosted by an AI researcher and a medical doctor, this podcast unpacks how artificial intelligence and emerging technologies are transforming how we understand, measure, and care for our bodies and minds.

Each episode unpacks a real-world topic to ask not just what’s new, but what’s true—and what’s at stake as healthcare becomes increasingly data-driven.

If you're curious about how health tech really works—and what it means for your body, your choices, and your future—this podcast is for you.

We’re here to explore ideas—not to diagnose or treat. This podcast doesn’t provide medical advice.

All Episodes

Code & Cure

#41 - If You Cannot Trace The Data, Do Not Trust The Model

April 23, 2026 • Vasanth Sarathy & Laura Hagopian

0:00 | 29:45

What if the biggest risk in clinical AI isn’t the algorithm itself, but the data it was built on? A model can appear accurate, polished, and ready for real-world use while quietly relying on datasets with unclear origins, missing documentation, or hidden flaws. In healthcare, that is more than a technical issue. It is a patient safety issue.

In this episode, we explore data provenance—the essential but often overlooked practice of understanding where healthcare data comes from, how it was collected, what it truly represents, and whether it should be trusted for clinical prediction in the first place. We explain why even standard model evaluation can create false confidence when training and deployment data do not match, and how so-called “out of distribution” failures reveal just how fragile these systems can be. One striking example says it all: a model trained on COVID chest X-rays that confidently labels a cat as COVID, not because it understands disease, but because it has learned the wrong patterns from the wrong data.

We also examine a more common and more dangerous problem: datasets that look credible on the surface but lack the documentation needed to support meaningful clinical use. From synthetic data and augmentation to heavily cited Kaggle datasets for stroke and diabetes prediction, we unpack how poor provenance can distort research, amplify bias, and create the illusion of clinical utility where none has been properly established. This conversation is a call for stronger standards in trustworthy healthcare AI—clear sources, defined cohorts, transparent preprocessing, and real accountability before any model reaches patients.

Reference:

Evidence of Unreliable Data and Poor Data Provenance in Clinical
Prediction Model Research and Clinical Practice
Gibson et al.
medRxiv Preprint (2026)

Dozens of AI disease-prediction models were trained on dubious data
Basu
Nature News (2026)

Credits:

Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/

Why Data Provenance Matters

SPEAKER_01 0:00

What if the model guiding a clinical decision wasn't wrong because of the math, but because no one really understood the data behind it?

SPEAKER_00 0:19

Hello and welcome back to Code and Cure, the podcast where we discuss decoding health in the age of AI. My name is Vassant Sarathi. I'm a cognitive scientist and AI researcher, and I'm with Alara Hagopian.

SPEAKER_01 0:31

I'm an emergency medicine physician. And today we're going to talk about this concept of data provenance and the idea that if you train models on crappy data, then you get crappy things out of them.

SPEAKER_00 0:48

Yeah, this is the whole garbage in, garbage out world, right? That's always been said about computers in general, you know, in general.

SPEAKER_01 0:53

Yeah, exactly. And when you think about applying this in a clin with a clinical lens, it's like, okay, well, the garbage out then means like you're making clinical decisions for someone based on garbage, which is like, ugh, not good.

SPEAKER_00 1:10

Yeah, yeah, yeah, yeah. I mean, ultimately a lot of these um machine learning AI models are meant to make our lives easier um by allowing machines to do the things that we um, you know, routinely do and speed it up and scale it up and so on. But if they're wrong, then that's obviously a problem.

SPEAKER_01 1:28

And it's not necessarily a problem with the model itself, right? It's like the problem is in the training data. So, you know, you have tons and tons and tons of data available, and you're like, okay, well, let's like crunch these numbers, let's find the patterns, let's predict something, let's uh make decisions based on them. But you have to make sure all that data that you're putting into the system is actually like good and correct.

SPEAKER_00 1:50

That's right.

SPEAKER_01 1:51

And if you don't, then you end up with a a big problem.

SPEAKER_00 1:54

Yeah, yeah. And there's many dimensions to what makes a data piece of data good, correct, a reliable, all these words, right? And you know, when you think about clinical data, you're thinking about concretely what you're thinking about when you see data is a file on a computer with numbers and other things on it. And usually data like this has records. So you have maybe a big table that has a whole bunch of different rows where each row provides information about a patient, about their various physiological measures, maybe, or some other things, and then some kind of label that says this patient has a certain disease or not. And that's a very generic uh type of data set, which um, you know, you would have, you would train a machine learning model to predict that label, that thing, that disease or whatever, based on the other factors that were provided to it. But all of those rows in that table came from somewhere.

SPEAKER_01 2:52

Hopefully from someone, right? Hopefully they were made up. It's like real.

Generalization And Real World Fit

SPEAKER_00 2:55

They were real human. Because that's the idea. You want because ultimately, again, the point is the machine learning model needs examples. Uh, and by an example, I mean it needs input and output examples to learn how to go from the input to output. What's the pattern, you know? And so the that's the basis of all of data. And what makes a good data set has many different meanings to it. One meaning is that whatever data set you have is in some sense finite. Maybe you have 220 ex uh rows, maybe you have two million rows, but the data set has a finite number of things in it. And so, based on that finite number of things, you're gonna have to predict in the future other things that are not in the data set, right? That's the model's job to generalize what's in the data set, learn the patterns, and then apply it to a new example. And that new example uh needs to be the kind of data that you've looked at in your 20 or 2 million size data set.

SPEAKER_01 3:51

And we've talked about this before, right? Like if you have this huge data set, you might train on a chunk of it.

SPEAKER_00 3:56

Yeah.

SPEAKER_01 3:57

And then you would test on some part of the remainder that you haven't trained on. Like you set it aside from the beginning. Yeah. Um, and then you're like, okay, well, I want to test, test on the rest of this set and make sure it's like functioning as we intended it to function.

SPEAKER_00 4:12

Yes. But even in that case, once you have decided the model works really well and tested it on that what's what's called a holdout data set, uh holdout part of the data set. Um, once you've done that, you have you you say you come out and say, Look, I have this box, I have this model, it works. Then you go out there and put it out in the real world. And now you're getting real information coming in. And you have to, you have to be sure that that information has the same shape and the same behavior as both your original training set and the thing that you left out. All of that stuff was done before you put this model out in the world.

SPEAKER_01 4:45

And of course, like, and not surprisingly, if you've trained on data that was maybe some from some one region of the world, like from Asia, it may not apply that well, say in the United States. Yes, yes. And that's that's something you always have to keep in the back of your mind, which is you know, we need to apply it in the same situation, or it may not be generalizable to every population. And so you have to be careful, careful when you start to apply it.

Out Of Distribution Cat Example

SPEAKER_00 5:14

Machine learning researchers use these uh use jargon words like uh out of distribution to suggest something uh as a data point that is not something within the learning space that the machine was meant to learn, something that's outside of it. And the and and the classic example of this is the uh there was a study that was done with COVID lungs, and the data was X-ray images of COVID lungs, and the um the label was either person has COVID or the person doesn't have COVID. And the images were shown, and you know, it trained on that data set, and then someone decided to show it a image, which is also, you know, it's the same format, right? An image is just a computer file. So somebody gave it an image of a cat, and it was 100% confident that this was a COVID lung. Now that sounds ridiculous to a human, right? You look at a picture of a X-ray image of a lung and you look at anything to do with it, but that's not what the the the the uh model learned. The model learned given these pixels and these arrangements, this is the prediction I can make. This is the pattern I see. So the pattern that it's seeing in the pixels of the cat is based on that. But obviously, the cat is out of distribution. That is not the type of pattern you care about. If you are interested in the task of detecting cats, then you wouldn't be giving it COVID lungs to train on. And so there's a mismatch between what was in the data and what was out there. That's like one set of issues with data, right? There's the issue of having the machine learning model learn something very specific and then put it somewhere else where it's not meant to be. But that's different from even if you had a situation where you had these COVID lungs images and then you had another image of a lung, you could still have data problems if that original image or that original data set was unreliable in some other way.

SPEAKER_01 7:04

Well, that's that's actually what I was gonna ask you. I was gonna ask, like, you were talking about, okay, maybe there's 20 rows in a table versus two million rows in the table. Well, like, how do you even get two million rows? You mean you may not even be able to get that much healthcare data, especially because uh, you know, it's really hard, it you have to de-identify everything and you have to make sure that like a patient can't be re-identified, all this stuff. So I'm curious, like, do you do people ever sort of simulate or fabricate or I don't know, like in in in any way make up data to help create these training sets?

SPEAKER_00 7:43

Yes. So even in this world where we have the right domain, we're not testing cats on on COVID lungs, even in this world, you have uh a bunch of images of COVID lungs. Um, and maybe you don't have enough images for machine learning model. These machine learning models need a certain amount of data to be able to learn uh what they need to learn, but just maybe you don't have enough. And this happens actually a lot in um not in images necessarily, but in the space of language, where maybe you have examples of some um small culture of folks and there's just not enough data in their specific uh specialized language. And so that's an example where you have um, you know, it's called a low resource language where you just don't have as much coverage. There's a lot of English data out there, but there might not be a lot of this specific language out there. And so what people end up doing is finding ways to augment the data. And they do that sometimes with the help of AI systems to create new data items like the ones that exist in order to kind of um bulk up the data set so that the machine learning model can learn uh the right or learn as much as they can. Now, this is a measure to take when you don't have enough data and there's nothing else you can do except hopefully try to collect more data, but potentially if you don't have it, then you have to find some other way to train the model.

SPEAKER_01 9:02

Okay, but is that like I I mean, I can like understand why it needs to be done, but is that bad? Like, are you training it on like fake data?

SPEAKER_00 9:10

Yeah, I mean, again, it comes down to the distribution of data. If that small, tiny 20 uh item data set has a certain behavior, certain pattern to it, and your augmenting process creates, you know, 200,000 more items of it of the same type of process, then the question is how good is that? The augmenting process might be fine if, in fact, in the real world, that 20 samples was really good. Was really representative. Right. Was really representative. And so again, this goes back to other data issues where you have a data set that even if you had forget synthetic data or uh augmented data or artificial data, if you had real data, if you had 200,000 real data items, there could still be problems if all of the uh patients in the data set were white males, then what you have is a data set that is essentially like the lung cat situation, where it's great at predicting um, you know, for for white males, but it might not be great for predicting for black females, right? And so you've now biased the data set has some certain biases to it. And that's another piece of the data that needs to be quote unquote good, right? There has to be, it has to be representative, real world data, but it also has to have coverage. It needs to be uh, you know, uh there are limitations of what it actually captures. Ultimately, we're not the data set doesn't contain every single human in the world. Right, of course not. So ultimately, we're making some guesses as to the balance and we try to our best to balance the data. It needs to be accurate. That is, it needs to actually be, you know, even in this case of synthetic generation, if it's producing medical lines of data that are medical lines but actually don't make sense, then there's a problem there too. Like there, maybe there's some medical measure that doesn't correlate with another. Um, but because the data was synthetically generated and because the system didn't know to look for that, it ended up being correlated, which would be weird for a doctor to see that, right? But the system doesn't know that. So that's an issue with the artificially generating data as well.

A Checklist For Provenance Quality

SPEAKER_01 11:09

Yeah. So it's interesting because what they did in this paper is they said, okay, we're gonna look at some of the common data sets that get used in a lot of clinical papers. Like they're used in review articles, they're used in clinical prediction studies. These are like very commonly used data sets and they're about common conditions, stroke, diabetes. Yes. And let's like figure out if they have good data prop providence. Let's figure out if it's clear, we understand where it's coming from, if it's authentic. And they actually went through a checklist um for two data sets that were used in a bunch of papers and prediction models. And they tried to figure out, hey, like, is are these are these good? And what were what were some of the things they looked at?

SPEAKER_00 11:56

Yeah. So they looked at, you know, things like um the sources of the data, how they were developed, um, were um, you know, how they were evaluated, you know, was there randomized trials? Was it what were the cohorts? Um, you know, how did they uh did they register the data? How representative is the data? Um, and then specific things about the data set collection process. When was it collected? What was the start and end? Who were the participants? What was the demographics? Um, what was the setting of the studies? Uh, what kind of quality controls did they do up front um to make sure that the data was good? How much did they, you know, did they actually test if the um the data set was going to be uh predictive of the of the sample of people and a whole host of other things like that? And um, what's the size of the study and so on and so on? If there was any missing data, what how did how was that handled? What what would what were the you know the missing spots filled with? Um or were they left empty? And so there's a whole host of questions. And data provenance is just a fancy way of saying I need to understand the origin and the context around that data data set.

SPEAKER_01 13:03

Like, do we think it's reliable essentially? And the stuff that you were just reading off, it it feels like any data set should have. It's like, where did the data come from? Like, when was it collected? Um, what are we predicting? All of those things seem like normal things you would expect to have in a data set.

Kaggle Datasets Used As Evidence

SPEAKER_00 13:25

Yes, and machine learning research usually begins by like stating a problem that you're um that you're trying to solve as a machine learning problem. And machine learning problems have a very specific format, which is you state what the shape of the input is, and then you shape you, you you you try to state what the shape of the output is, and then you try to come up with some ideas for models for how the input can be uh representative of the output. And then you that's kind of the rough idea of design of the of the models, but you often start with this broad issue and you um you try to build a data set and you try to build one that makes sense. Now, a lot of research um also depends on other people's uh data sets, right? I mean, that's that's normal. You you have others or experts who are building data sets and they're writing papers about the data sets that they're building and its utility. And those data set papers actually are very useful because what they also do is they um describe the provenance of the data set and they do some baselines that that is they they train some models on that data set and tell you how those models behaved. So that tell those are really good data set papers, and then other researchers can use those data sets for their own models and cite those papers, right? So that's typically how you know that's ideally how this is done. Now, there are websites, um, like for instance, the one used in this paper called Kaggle. Um, and Kaggle is a very popular source of data. Now, it's anybody can upload a data set into Kaggle. Like I can just upload one. Yes, and I can say this is a diabetes data set. You can. And, you know, and so and some data set and and and what was mainly used for is to help uh new and upcoming machine learning um, you know, students practice their skills on on various types of data sets. But it's also now sort of it hosts both kind of artificial data sets, but also real ones, and they're kind of overlapping. And and so some of the and so the two data sets they used in um this paper, one for stroke and one for diabetes, are uh on Kaggle and they are heavily used. I mean, the last I checked, there was like 290,000 downloads and increasing, like it's going up.

SPEAKER_01 15:35

That's right. Yeah.

SPEAKER_00 15:36

That's right. And people can comment on them and you know, and so on. But and so this study, they looked into those data sets and looked at, okay, what was there? What questions did they answer the questions on the data provenance, all the things that I just mentioned?

SPEAKER_01 15:50

No, they answered zero of zero of them. There's this whole checklist that they went through and they were like, oh, do we know the source of the data? Do we know the date it was collected? Do we know the setting it was collected? Um, do we know any sort of pre-processing or quality checking? You know, do we know the outcome that's being predicted? Do we know how missing data was handled? I could go on and on, but there was like a nine-item checklist that they went through. Zero. Like zero of those items were there. So it's like we don't know the data provenance. And in in one of the data sets, they actually said, hey, this is for educational purposes only. It should not be used for research. It should not be used for for um economical purposes either. And so it's like, well, if it's just for people trying to mess around and build a model, that's one thing, it's not supposed to be used for clinical prediction. That's correct.

SPEAKER_00 16:44

And it is being used for clinical prediction. Well, that's the next step, right? So the uh the there's a whole bunch of papers that the authors in this in this paper identified that used those data sets.

SPEAKER_01 16:54

Exactly. And so now if we're making decisions based on data that we don't know where it came from and we don't know if it's good.

Red Flags Inside The Data

SPEAKER_00 17:01

Yes. It's not a good thing. And I think it was really shocking to me a couple of things, right? One was the data set had some inconsistencies, like medical inconsistencies.

SPEAKER_01 17:09

Yeah, and we can we can talk about that because there were definitely some irregularities when they started to look at the data. It was like just kind of weird patterns. So, first of all, when you have like a real world data set, it's always missing a ton of information, right? It's like, oh wow, super messy. It's messy. It's and and and it's like that's frustrating. I find that frustrating, but that's like something that I would predict to happen. Like if you're like, oh, I wanna have, I want to make sure that we have weights on everyone and heights on everyone so we can calculate their BMI. And you have like 50,000 people in your data set, there's no way you're gonna have heights and weights and BMIs on every single person. There's just like no way. Um, and so pretty much no data points were missing in this data.

SPEAKER_00 17:53

That's incredible. Yeah.

SPEAKER_01 17:54

I mean, it it's like it makes me think that it's not real.

SPEAKER_00 17:58

Or they handle the missing data without telling you how.

SPEAKER_01 18:01

Yes, exactly. So, you know, we would have to have like a reason. If they were omitting a person because they didn't have data, that actually changes your data set significantly because there might be some sort of pattern. Yep.

SPEAKER_00 18:14

There's another variable, another reason for why those that that's exactly.

SPEAKER_01 18:17

Like maybe we don't have a weight on someone because there was no scale in the clinic in a low uh income neighborhood, or there was, you know, we didn't have the heights on, you know, 10,000 people. So we just got rid of those. That that data, like that doesn't does does not compute. You're you're changing uh what your your output's gonna be by doing that.

SPEAKER_00 18:37

Right.

SPEAKER_01 18:38

But then there was some other weird stuff um in this data where, for example, okay, so blood glucose is measured in the moment, like you prick your finger and you can check it. Um A1C, a hemoglobin A1C is a kind of a measure of blood glucose over the course of three months.

SPEAKER_00 18:58

Okay.

SPEAKER_01 18:59

And so in your mind, would you expect those to be correlated?

SPEAKER_00 19:04

Yeah.

SPEAKER_01 19:04

Yeah, exactly. And in this paper, there was like not a normal association. You would expect a strong association between those things, and actually between body mass index too. And what they found was not a strong association in the data set, which doesn't make sense. You would expect that, like, if one measure of glucose is high, the other measure of glucose would be high too. If one is low, the other would be low too. So it's like makes you scratch your head, like, ugh, like that that feels weird and it makes me wonder if there's a problem with the data. Right. And then the other thing that they called out was um that for hemoglobin A1C and for blood glucose, there's like a there's a wide variety of values that could be. Like you could have for hemoglobin A1C, you could have a value of 5.1, 5.4, 5.8, 6.7, 7.3, 8.9, 9.2. I could keep going, but like uh what I'm trying to show is that you could have a ton of different values for it.

SPEAKER_00 20:13

Yeah.

SPEAKER_01 20:14

And in these data sets, for both A1C and blood glucose, there were just 18 discrete results. When you're talking about 100,000 people, like that doesn't even make sense.

SPEAKER_00 20:24

Right. Why why was the how are they clustered into those eight discrete groups?

SPEAKER_01 20:27

18, 18. It's about 18, it's still 18, but like it doesn't, it doesn't actually make sense. Like if your average blood glucose is 118 and mine is 121, and somebody else's, like it could there could be a huge range there.

SPEAKER_00 20:42

For 100,000 people, you could have you might have some overlaps, but for sure you could have overlaps. You're not going to just have 18 groups. That doesn't make any sense, right?

SPEAKER_01 20:49

Yeah. And so it's like you start to like look under the hood of this and you're like, well, I I I don't know where the data came from. Yeah. And now that I'm looking at the data, it's weird. Like, there's no missing data, there's not associations where I expect them to be. And there's like this weird finding of like only 18 discrete results for blood glucose and A1C. Doesn't, it doesn't make sense. And so then it's like, well, if that's what's being used to create models that then do clinical prediction and get used in clinical practice, like that's that's maybe not good.

Publish Fast And Deploy Faster

SPEAKER_00 21:27

No, and it was to be honest, honest, it was very it was shocking to me to to read that that data set was used extensively in so many research papers. Research papers that well, I think what was most amusing, not amusing necessarily, but scary was some of those research papers did say, make claims about the origin of that data. And they were all different. And you know, one said it came from a prestigious um institute, another said it came from something else, and one of them said it came from Bangladesh. And so there was how can they all All be there was like four or five different places that that same data set was attributed as being the source. Like first of all, how uh it's not possible to have four different ones. This is one source. But but also uh you know, those research papers, there was some degree of um sort of exaggeration or something that uh had the authors make up the source of the data, right? Which is interesting because there is uh, you know, I understand to some degree in an academic setting you have research pressures to get papers out.

SPEAKER_01 22:33

And uh this concept of like fast churn, right?

SPEAKER_00 22:37

Yes, yes, exactly. And so the academics and their students are trying to get papers written, they go to the nearest resource, they go on Kaggle, they look for diabetes data set, they find one that has 290,000 downloads and um hasn't been retracted. And so they think, okay, maybe this thing is meaningful. Let's use it and let's write a paper. Oh, we came up with a new way to do better on this. The current best model on this has 85% accuracy. We're gonna do some, you know, play around with our model and look, we're getting 95% accuracy. Oh my god, this is great. It's doing so well on this diabetes data set, it has all these other implications. We can now use it in this setting, in that setting, and so on and so forth. That's not that uncommon for people as a process to follow to write a paper on uh a new model for this.

SPEAKER_01 23:26

And the idea, your idea here that I'm I'm hearing, anyways, is that people want to publish quickly. Yes, and uh and easily. Uh, and it's not really about necessarily like building our understanding of health or predicting or improving health outcomes. Like the goal is to get something published. And so if there's a data set available, let's just publish it. But then on my side, I'm like, oh gosh, like now they're using this to like predict stroke, say that we can diagnose stroke, say that we can reduce the incidence of diabetes. But I'm not sure that any of that can actually be true if the data behind it isn't valid.

SPEAKER_00 24:08

That's right. That's right. And and they found that, you know, not only the not only were there a lot of papers that were written that used that data set, but then those papers also claimed real-world utility of their models. So they took it one step further as well. And and that step means that somebody reading the paper who was not aware of this is going to think, oh, wait a minute, we have this model here. We can use it in our setting, in our hospital setting, in our clinical setting. And that that's that's a that's directly being deployed. So you have a really bad model that's being directly deployed.

What Journals And Clinicians Must Demand

SPEAKER_01 24:43

And so now I'm like, whoa, like let's take a step back because uh, you know, before we had the ability to do this with LLMs and machine learning and these huge data sets, this wasn't something you maybe had to think about as much. I'm not saying, I mean, data providence has always been an issue, but here it's like everyone can just grab these data sets from anywhere and just do whatever research they want on them, right? And it's like, well, I think we need to take a step back and have everyone think about the data provenance problem before they use these giant data sets. And that includes like journals and publishers, it includes the data repositories like Kaggle, it includes the the researchers and the clinicians, everyone kind of coming together and saying, Hey, before I use this data, like I need to know why was it collected? Who collected it? Where was that done? Um, when was that done? And who funded the data collection? And um all of that should be like part of the data set.

SPEAKER_00 25:46

Yeah.

SPEAKER_01 25:47

Right. And if it's not, then maybe like a journal would have to go back and say, hey, I need I need this information. You need to get give me this information before I can publish a study on this.

SPEAKER_00 26:00

Right, right. And the challenge, of course, is you have to make sure that the um that that that information itself is accurate, right? And so I think that encouraging the data set makers to themselves, you know, write up what they've done. Now there's a that there is a uh an issue, which is some of the data might be confidential, in which case, what do you do then?

SPEAKER_01 26:19

Yeah, you how do you de-identify this huge amount of data? But that's something that has been done before. It's just like another step.

SPEAKER_00 26:26

Yes.

SPEAKER_01 26:27

Right. And you were talking about how when people are trying to learn, they might download a data set from Kaggle and it doesn't really matter, like if it's poor provenance, because that's not the goal. But if the goal is to use it for research and for clinical decision making, at that point, I think you have to mandate some sort of provenance reporting and make sure it is de-identified in some way. Um, and and that piece is really important for something that you're going to actually make decisions off of.

SPEAKER_00 26:59

Yes, agreed. And I know I went back on Kaggle to look at this particular data set, one of these data sets, and somebody posted a comment there saying, citing the paper that we're just talking about, saying this just the data set should be not used at all, which is good, which is but you know, there's sort of a full circle. People have looked at this paper, identified the issue. But uh even after this paper was written, there was, you know, the uh people are still downloading the data set. So, you know, I think that you know, this the solution is multifaceted, which is you need the researchers to be um to be explaining the provenance of their data. Um, you need the journal uh reviewers to look for that kind of information. You need um people who deploy these systems to understand where the models and the data came from for each system so that they can be sure that things that were not meant for, uh that were only meant for educational purposes or things that were only meant for whatever were not just like immediately deployed. Now, this is easier said than done because everyone has the research churn pressures, incentives to publish, but there's also incentives to deploy AI and all these new machine learning, fancy machine learning tools quickly so that you can show that your um organization has is up is up to date and and and is and is you know keeping up with the trends. So there's all of these other business and practical pressures that come into play. And you know, I think it's I'm glad we're talking about this because it is an issue and it's not a fully resolved issue. And I think that you know it it comes down to the individual researcher who's trying to solve a particular domain problem, understands their world, understands that domain really well.

SPEAKER_01 28:32

Yeah, absolutely. And I think um we can't assume that everything that is published at this point doesn't have any errors. Like I think part of this peer, like we're seeing all these papers pass through a peer review process, but I think there's this, okay, we need to take a step back and think about what was the data it was trained on, where is it coming from, and making sure that that is high quality at all these different levels, right? And making sure that there are good data sets available to our researchers and clinicians. And you can't just say, oh, well, this was published in 80 other papers, I'm gonna use it in mine. You actually have to go back and look at it. And if that data isn't there, isn't available, or doesn't meet these findings, like, oh, we don't know if it's synthetic or simulated, we don't know if any data was removed, we we don't know who collected it, then maybe it's time to step back and say, okay, we need to find a different data set.

SPEAKER_00 29:26

Yeah.

SPEAKER_01 29:28

All right. Well, I think that we can wrap up here. Um, thanks for chatting with me about data providence today.

SPEAKER_00 29:35

Thank you for joining us.

Laura Hagopian

Host

Vasanth Sarathy

Host