Curating Medical Image Datasets with Jie Wu from Segmed

The accelerated development of medical AI could be life-changing for patients. Unfortunately, accessing large amounts of diverse, standardized data has been a major stumbling block to progress. That’s where Segmed comes in, a platform that allows researchers to access diverse, high-quality, and de-identified medical imaging data. Crucially, Segmed’s platform also provides data for medical AI training and validation.

I am joined today, by Segmed’s co-founder, Jie Wu, to discuss how they are solving key data issues to rapidly accelerate medical AI development. You’ll hear Jie break down some of the biggest challenges in curating medical image datasets — including the extra computational power needed to handle high-res medical images, like CT scans — and how they are addressing these obstacles. Jie also takes the time to emphasize the need for diversity when curating medical image datasets and the importance of mitigating bias during the data curation phase. To learn more about Segmed and how they are contributing to the development of medical AI, be sure to tune in today!

Key Points:

A warm welcome to Jie Wu, co-founder of Segmed.
Insight into how Segmed is solving data issues to accelerate medical AI development.
Why solutions to these data issues are crucial for medical research.
Segmed’s focus on medical imaging data.
Their approach to different imaging modalities.
An overview of the key challenges in curating medical image datasets.
How Segmed determines the amount of data they will need.
Best practices for curating a training set of medical images.
Why collecting a diverse range of images is essential.
An overview of how the quality of labels is assessed by experts.
How imaging modality influences Segmed’s approach to creating datasets.
The variations in datasets across different imaging pathologies.
Special considerations that inform the validation set versus the training set.
How bias manifests in models trained on medical images.
Steps that can be taken to mitigate bias during the data curation phase.
How the need for diverse datasets has increased along with greater awareness of bias.
Jie’s thoughts on the future of foundation models in the medical AI space.

Quotes:

“A high-resolution of CT can take up to several gigabytes of storage itself.” — Jie Wu

“I think the most important piece is actually to collect as diversely as possible. So I ask that given the budget limit or maybe time limit, the size of the data set will be limited but it should be at least representative of the target population and targeted practice.” — Jie Wu

“The best quality labels are curated by experts and it is curated by multiple experts.” — Jie Wu

“A 3D image stores much more information than the 2D images, so you need less data for that.” — Jie Wu

“The external validation datasets require much more carefully curated datasets and much higher quality labels, and also it needs to be representative of the population, of the institutions, and also geographical locations.” — Jie Wu

“We hope that we can enter into the development of AI and make these algorithms go to market faster and benefit more people.” — Jie Wu

Links:

Jie Wu on LinkedIn
Segmed

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.

[INTERVIEW]

[0:00:34.6] HC: Today I’m joined by guest, Jie Wu, cofounder and chief data officer at Segmed, to talk about medical image datasets. Jie, welcome to the show.

[0:00:43.2] JW: Hey Heather.

[0:00:43.5] HC: Jie, could you share a bit about your background and how that led you to create Segmed?

[0:00:48.0] JW: Of course. My personal background is actually in the data science and the Mission NE space. I did my Ph.D. in sustainable development and smart city at Stanford University. Actually, while I’m doing my Ph.D., I met my co-founders there at the entrepreneur program at Stanford, they’re called Ignite. Though at that time, my cofounder, Martin, which is our CEO right now, he was converting for some AI companies that he’s happy with in data annotation for some medical AI development, and at the moment we think all of them actually need some data annotation for their algorithms.

So how about we start a company, try to solve this problem for them? But later we realize that actually data itself is a bigger problem and more important problem for a lot of AI companies. So that’s how we like pivot a little bit towards solving the data issue itself.

[0:01:44.0] HC: So what all does Segmed do and how do you solve this data issue and probably more importantly, why is this important for medical research?

[0:01:51.3] JW: So currently, we are developing a technologies to block the excess to the medical imaging data. So we’re only focusing on the medical imaging data right now. As you probably know, the current R&D, especially AI development, they largely depend on data and the computation powers and there are many companies like AWS, like they are already trying to solve the computational problems by moving to cloud, by having more powerful GPOs but to the data itself, it’s on soft problem in medical imaging.

If you ask ChatGPT about what kind of data ChatGPT is trained on, what they tell you is, it is trained on probably their site but in the medical data space, there is not enough public data available, especially for medical imaging, which is larger than most other medical data. They are also stored in the – has consistent that allow design for on data central R&D. So our mission is actually just to develop technologies to unblock the excess to this piece of medical imaging data and make it then available for people to do R&D.

[0:02:55.5] HC: What types of the imaging modalities do you work with? There’s a number of different medical image modalities out there. Do you work with all of them or certain types?

[0:03:03.2] JW: Yeah, we start with larger images. We have all – we work with our own radiology imaging modalities such as MRI, CT, Ultrasound, Echocardiogram but we are also expanding to the other data types, such as pathology and dermatology. It really depends on the need of our clients.

[0:03:21.1] HC: So why is curating a medical image data set so challenging and how does it compare with the challenges of other types of medical data?

[0:03:28.3] JW: So I’ll say, medical imaging data itself is actually multimodal already. So yearly, radiology images are stored in a format called DICOM on either council with both pixel data and also metadata stored in the headers of those DICOM files and also together, people really need to also relevant other types of information, such as radiology reports as part of the ground rules or labels. So this multi model is probably or either itself, actually already created a lot of problems for a lot of people because we need to handle mostly pixel data as well as the text data and then it will require the technology to do it, the idea on both of the data type. It’s also required different master to store different information and secondly, I’ll say medical imaging data is also stored in a system that currently is not very user-friendly for batch operation and for the data R&D. It’s very hard to do search and pure leading in the current neck of existence and also the image itself, it actually is, require much higher computational power. Actually, computational resource like a high-resolution of CT can take up to several gigabytes of storage itself, and use the transfer for data or the processing of the data require a very scalable pipeline.

[0:04:48.6] HC: So when you start curating a new dataset, how do you figure out how much data you need to collect? Generally for machine learning, more data is better of course but do you have a sense ahead of time of how much data might be enough?

[0:05:01.5] JW: Yeah, yeah. So I guess yearly, the person I see is that research as well, start to collect a few hundred to a few thousand to train the algorithm to test it out to say, “Okay, whether the algorithm performing well or it needs more data” and it depends on the algorithm, data itself, requirement may change.

For example, algorithm with targeted application or maybe targeted population would need less data. And if you are developing a more general model, targeted at multiple applications with at least on a single model, like you will need a larger amount of data and you may know this, people start to develop foundation models. I wish it is more general and it is designed for being able to handle all different kind of like a medical imaging type.

So they will actually need much more data than the traditional one and the exchange of landscape, how much data people will need. Yeah, definitely the [inaudible 0:05:55.1] it depends on the algorithm and the data.

[0:05:57.5] HC: So once you understand the application and the goals for a particular client then you can provide some guidance on how much data and what is reasonable to collect for a particular disease?

[0:06:07.8] JW: Exactly, exactly. Yeah, and your little processes that you solve will be divided into different stages. So first stage, people just try to gather some data and try to play with it and then once they have found preliminary results, they will all know actually how much more data we will need, and then at the second stage, it’s more about some serious development.

They will figure out the number and then absolutely moving to the validation stage or tasking stage. There, you could collect the additional datasets.

[0:06:36.8] HC: What are some best practices for curating a training set of medical images?

[0:06:41.5] JW: I think the most important piece is actually to collect as diversely as possible. So I ask that given the budget limit or maybe time limit, the size of the data set will be limited but it should be at least representative of the target population and targeted practice. I think this is the most important piece and then also, there are also other elements, such as you have to collect and balance the dataset.

The dataset cannot only include maybe two data points of a certain [inaudible 0:07:11.4] and there are hundred data points of other [inaudible 0:07:13.5]. It will cause some various problems of the idea, of the algorithm. So really I think it will be also super important to collect the high-quality labels and of course, it takes time, it takes money to do it but the higher quality the label is, the more equity really is.

[0:07:30.2] HC: So what goes into assessing the quality of labels? What things do you think about there?

[0:07:35.0] JW: Yeah, so there are some human-curated labels, there are also some machine-generated labels. For example, for the medical radiology space, people also can generative labels based on the radiology reports, try to then extract the labels but using some peak tools or maybe some human-defined rules. Those labels, it really tends to be like a lower quality, and then because the error cannot happen during the extraction piece of the ARP algorithm, it can also happen at the technology stage we’ve seen on the radiology report.

I guess the best quality labels are curated by experts and it is curated by multiple experts, we have consensus on the labels but yeah again, this is really a very expensive and also time-consuming and you really only happen that at the validation stage.

[0:08:27.1] HC: Does the imaging modality influence any part of how you create a data set? For example, pathology versus radiology and the types of images?

[0:08:35.4] JW: Of course, of course. So I think there’s a very large difference actually between the pathology and the radiology because from all our standard radiology, it is a more type of better standard like you were the all in danger of exams installed in the DICOM format but the pathology currently found overall standing is not so standardized. Different institutions may have different format of installing those images. Of course, then the creating process will be different and even within radiology itself, they can be different quite a bit.

We then pull yearly 3D images such as CT, it will be much larger than X-rays and that it needs much more complicated annotation. If you annotate each single slice into sometimes for the pixel data. Also, it will – those 3D image as you’re the – will require different algorithms compared to the 3D image effect such as X-ray because the kind of computational result needed to train the algorithm is different, people have to compute those differences. Thus, the data infrastructure would be different as well.

[0:09:37.2] HC: Will the size of dataset, the number of patients that you need to solve a problem, vary from what one imaging pathology to another?

[0:09:43.5] JW: Yes, yes. I think that yearly, we can see that people will use larger datasets in X-rays and compared to safety. I think one explanation is that, let’s say X-rays it is easier to get. This is one and the other explanation is actually just a 3D image stores much more information than the 2D images, so you need less data for that.

[0:10:05.6] HC: Validation is important for all uses in machine learning but it’s especially critical for medical applications. Are there any special considerations informing the validation set that might be different than for training set?

[0:10:16.6] JW: Yes, yes. Like I say in the world of AI development but in addition, the outside if you really defined as the dataset to fine-tune the algorithms, this is the second stage before trying to test the algorithm with the test and the effort. I’m not sure, are you referring this validation, the asset as the validation of the assets for fine-tuning the model, or the validation of the asset for the actually FDA certificate?

[0:10:39.4] HC: Validation like you know, an external cohort of data or for FDA approval if that’s what’s needed for a particular application.

[0:10:46.0] JW: Yes, so yearly I think the validation assets now require very diverse datasets and it needs to be checked against the targeted population. It needs to, for example, yearly the US FDA validation of the asset, they require much more different institutions at least two to three different states. So it needs to be a check against whether it’s represented, evolved for the population, and also validation of the asset is yearly. If you have higher quality labels and good people, you already have the validation started, read by radiologists again against the original like radiology report, yes. So the shorter answer is that yearly, the external validation datasets requires much more carefully curated datasets and a much higher quality labels, and also it needs to be representative of the population, of the institutions, and also geographical locations.

[0:11:39.0] HC: With the recent focus on ethical AI, bias has been in the news a lot, and how does bias manifest with models trained on medical images, and what are some things you can do in the data curation phase to mitigate it?

[0:11:53.1] JW: Yeah, so I think from what we can tell in the industry is sometimes the algorithm, you’re trained on data for a specific institution with maybe a lot to work on another institution because they are – how they are preferably that may be different. The protocol may be different, and also like I wouldn’t train on one machine. I may locate another machine from another vendor and then there is also some difference in the amount of population itself.

So I guess it is the way to do it, you just have to collect and establish the asset as possible. Researchers need to check the heterogeneity of the data across different population of institutions and try to tackle that if they see actually a significant heterogeneity. [0:12:34.6] HC: And the clients that you work with, have you noticed an increase and the awareness of bias, and then the intro-diverse datasets has it increased over time or are there still some organizations that aren’t yet aware of the challenges with bias?

[0:12:48.3] JW: Actually yes, we do think actually the industry tends to pay more attention to this piece of specific development of AI and then we can see people are required a much detailed demographic information like recently the acquired state. I think you know this, people pay less attention to recent ethnicity but now, it’s almost like a must-have wanting to ask to the development of AI. We want to make sure it can treat all populations or recent exits equally, the AI algorithms.

[0:13:17.6] HC: As someone who works in this industry, it is definitely good to see the increased focus and the increased awareness related to these ethical issues.

[0:13:25.3] JW: Yes.

[0:13:26.0] HC: Is there any advice you could offer to other leaders of AI startups?

[0:13:30.0] JW: Well, one thing I see the trend is that there are many companies that are trying to develop some larger AI foundation models also in the vertical space of medical imageries. Yeah, I also just thought AI stop – they needed to pay attention to the development of those AI foundation models either will likely turn to the way how people will develop new algorithms, and also it would change the magnitude of data needed in the development, so they need to be more smart about the data strategy.

[0:14:00.3] HC: So being more careful to get diverse data and to minimize bias, is that what you’re getting at?

[0:14:05.7] JW: Not really. I’m just saying that I think people are trying to develop a larger AI foundation models, which is aimed to cover not only one – specifically this but trying to understand the basis of medical images and those foundation models can be fine-tuned to answering tasks for specifically to this is change the way how people do their algorithm before. I think historically, people have been focused very specifically to this, for their AI development, for the algorithm development.

They develop different algorithms for different [inaudible 0:14:37.9] but now, you know with recently of a larger foundation model, people think that “Hey, maybe I can just develop a very large model with a lot of parameters that this model can send everything, go to images in the medical space and then for specific tasks, I can just fine-tune these model they did to a specific task.” So this is different from the traditional way of how people are developing AI medical models.

[0:15:03.3] HC: Do you see foundation models as the future in this space?

[0:15:06.2] JW: I will say this is still in the early stage. For given that in the tech space, people can already use the foundation model quite a bit. I do believe that it will significantly change the landscape here.

[0:15:17.5] HC: Finally, where do you see the impact of Segmed in three to five years?

[0:15:21.1] JW: Segmed is aiming to become the number one player in the medical imaging space. We would like to see many more FDA-approved algorithms using Segmed diversity effort. That we hope with our technology, with our data, we’ll enter their own circle of medical AI within this [inaudible 0:15:38.2] but also improved. Yeah, we hope that we can enter into the development of AI and make these algorithms go to market faster and to benefit more people.

[0:15:51.1] HC: This has been great. Jie, your team at Segmed is doing some really important work for medical imaging research. I expect that the insights you shared will be valuable to other AI companies. Where can people find out more about you online?

[0:16:01.7] JW: People can follow me on LinkedIn, I’m pretty active on LinkedIn, yeah.

[0:16:05.2] HC: And Segmed’s website is segmed.ai, is that right?

[0:16:08.6] JW: Yes. Yes, you can find my information there. [0:16:11.0] HC: I’ll include links to both in the show notes. Thanks for joining me today.

[0:16:15.2] JW: Yeah, thank you very much, Heather, for having me.

[0:16:16.9] HC: All right everyone, thanks for listening. I’m Heather Couture and I hope you join me again next time for Impact AI.

[0:16:22.8] JW: Thank you.

[END OF INTERVIEW]

[0:16:27.2] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend, and if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]