- Tobias's professional background and why he created Kheiron Medical Technologies.
- Learn about the amazing work Kheiron Medical Technologies does and why it is important.
- Overview of why detecting breast cancer early is so vital and the challenges of screening.
- How AI can help resolve the current challenges in cancer screening.
- He explains the machine learning process and training the model used.
- The complications encountered in working with radiology images.
- Find out why image quality is key to the machine learning process.
- How he is able to account for the variation of technology and methods used.
- Outline of the regulatory process and how it impacts machine learning model development.
- Hear advice Tobias has for other leaders of AI-powered startups.
- Details about how Tobias approaches improving the models over time.
- Tobias tells us what Kheiron Medical Technologies has planned for the future.
[00:00:00] HC: Welcome to Impact AI, the podcast for startups who want to create a better future through the use of machine learning. I’m your host, Heather Couture. Today, I’m joined by guest, Tobias Rijken, CTO and co-founder of Kheiron Medical to talk about early cancer detection. Tobias, welcome to the show.
[00:00:18] TR: Hi, Heather. Hi. Thank you for having me. Looking forward to the show today.
[00:00:22] HC: Tobias, could you share a bit about your background and how that led you to create Kheiron?
[00:00:26] TR: Yeah, happy to do that. So, I was born and raised in Amsterdam, born into sort of a medical family, just like my co-founder, surrounded by general practitioners growing up and I was sort of the odd one out doing mathematics and computer science, initially in Amsterdam, and then I then moved to London, which was a great place to be in machine learning at that time. I was at UCL, at the Computational Statistics and Machine Learning group, which is mostly known as the place where DeepMind was founded.
So, I’ve been exposed to some of the work. I worked with Professor Thore Graepel, who was one of the lead scientists on AlphaGo. And really, what I liked so much about machine learning is the ability it passes to solve real-world problems. And in my opinion, real-world machine learning is very different from academic machine learning.
So, when I met Peter, who had just finished his Ph.D. at Oxford in high-performance computing, we got together and we realized that one, we have very complementary skill sets. But also, we understand the importance of deeply understanding the domain when you’re trying to solve a real-world problem, and Peter’s mother is a radiologist. He grew up in Budapest, and essentially, the radiology department was his daycare. So, we realized, well, the two of us appreciate the importance of the domain, we understand the technology and we started Kheiron. And really, what we recognized was that there is a growing shortage of radiologists, and this is a structural problem. The problem that we’re seeing is that the number of scans that we’re taking is growing at a much faster rate than the number of radiologists that we’re training. And this has all kinds of impacts down the line, really impacting cancer pathways in particular, and that’s where we decided to focus our efforts.
[00:02:13] HC: So, what does Kheiron do? And why is this important for cancer detection and for patient survival?
[00:02:18] TR: So, the way that we look at this as we looked at the entire cancer patient pathway, and what we noticed when we look at each stage in the pathway, going all the way from cancer detection to diagnosis, to treatment planning, to follow-ups afterward, is that there are what we call information problems along the entire pathway. So, either the right information isn’t available, or it is inaccurate, or there’s missing information. We see AI as a tool to help address those information problems.
So, one of the things that we started was cancer detection, which is one of the most important building block tasks in the cancer pathway. Because before you can do anything else, you first need to find the cancer. Is there cancer here, yes or no? So that was our starting point. And one of the biggest areas where cancer detection plays a big role is in screening. Breast cancer screening in particular. And breast cancer screening is one of the most established screening programs we have today. We know that finding cancer early has a tremendous impact on both patient survival, as well as the cost of further treatment down the line.
So, for example, finding breast cancer at stage one versus stage four means eight times cheaper treatment costs and a much higher survival rate. So, the importance of finding cancer early is paramount. But the problem is, finding cancer early requires large-scale screening, which is very resource intensive, and that’s where AI comes in. So, what we started doing is really building AI that can do the same task as a radiologist, which is in screening, making a determination, should this woman be called back for further examination, yes or no? That’s the primary task that a radiologist needs to do at breast cancer screening. And this is what we set out to develop.
[00:04:17] HC: So, how do you tackle that task with machine learning? How do you train that type of model?
[00:04:23] TR: Yeah, so whenever we try to train a machine learning model, of course, the very first step we start doing is very, very tightly defining what the task is that needs to be solved. Because in a way, the task that you define, directly relates to and influences the loss function that you’re using to optimize your model. So again, we chose to solve this binary decision task. Should we call back that woman, yes or no? And so, what we ended up doing is, once you define the task, is okay, how do we build the right dataset to do that? Also, how does this fit into the wider workflow? And that’s actually where our technology fits very neatly into the existing workflow.
So, in the Western world, how breast cancer screening is organized, particularly in Western Europe is as a double-blinded screening program. So, there are two radiologists who both independently read each case, and they determine, “Should we call back this woman, yes or no?” When they agree with each other, great, you do whatever they agree on. When they disagree, there is a third radiologist, who arbitrates the case and makes the final decision based on the input of the first two.
Now, because we have developed a model that learns the same task, we can slot it into this workflow very neatly. Our product called Mia (Mammography Intelligent Assessment), will read beside the radiologist in a completely independent way. When the radiologist and Mia agree with each other, you do whatever they agree on. When they disagree, you still go to the arbitrator, so the workflow doesn’t change. However, we can now save roughly 50% of the workflow, because Mia becomes one of the readers. And that’s quite impactful in how we manage to get this into the real world.
[00:06:06] HC: What kinds of challenges do you encounter in working with radiology images?
[00:06:11] TR: Where do we start? The challenges are many. I mean, maybe the way I think about this, there’s of course, a technology challenge, a data challenge, and then I think one of the most important ones is sort of, “How does this work in the real world?” which is teaching us all kinds of new things and showing us all kinds of new challenges. But first things first, when we set out to build the technology, we were building our dataset. Of course, privacy was something that came up in the early days. But the good thing about this kind of technology is we don’t need identifiable data. What we need is a dataset of images, pixel data with some metadata, not all that metadata around, for example, what manufacturer does the image come from? What kind of post-processing software would you use? What was the compression level, et cetera? And, of course, the outcomes.
Now, that proved a bit more of a challenge, because what we needed the model to learn was, is there cancer in this case, yes or no? Now, then the question is, what do you consider ground truth? Are you using the radiology opinion? Are you using real outcomes data? What we decided is, well, the radiologist isn’t perfect, for one. So, while it can still give useful training input, we decided that it’s not the strongest level of information we can get. And we decided to go a layer deeper and get as close as possible to the ground truth.
Now, you can get that for a positive image, because when there’s been a biopsy, great, you have a biopsy result. But there’s an important nuance there, which is that when a radiologist missed the cancer, and the woman wasn’t called back, then there is no biopsy result until much later, when actually the cancer was found at a later stage. So, there’s a lot of missing data that influences how your labels are constructed, and potentially creates bias in the dataset. And I think that’s one of the key things that we were aware of very early stage, and we took a lot of care to address that. So, building our dataset, very carefully and understanding the potential sources of bias in the data, as much as possible, try to eliminate it or understand it, where we weren’t able to eliminate it and create diversity where possible.
[00:08:33] HC: So, I understand that image quality is one of the key things that you’re looking for. Why is image quality important in creating a robust solution as you’ve built? And is machine learning used for that aspect?
[00:08:45] TR: Yeah, yeah, absolutely. So, image quality plays a role in two parts of our work at Kheiron. One is the image quality of our own datasets because those datasets are being used to develop our technology and our models. So, we want to understand — we want to make sure that the quality of the data is high. And the second part is actually, image quality is important in clinical practice. When the breast in the mammogram is not well positioned, then this is known as a technical recall, and the woman has to be called back to take another scan.
So, we’ve actually built a product to help improve image quality, and we call this Mia IQ. And this is a product that’s completely utilizing machine learning to assess how well the breast is positioned and give feedback to the radiographer on how they can improve their positioning technique. And also, when an image is not well positioned at all, give a suggestion to retake the image. There is an important trade-off here between sensitivity and specificity. Because of course, you want to make sure that all your images are as well positioned as you can, but there’s a cost to a technical recall. Because it means that a woman needs to be scanned again, which means extra exposure to X-ray. So, you can’t just indefinitely say, “Okay, let’s do another scan. Let’s take another scan.” And sometimes it’s just not possible to get a better-positioned image.
For example, if the woman is in a wheelchair and is not able to stand up properly to get into the mammography machine. So, really understanding that trade-off between sensitivity and specificity and also for our users to understand, okay, when do we need our own judgment call to say this is the best image we can make?
[00:10:35] HC: Is machine learning used in Mia IQ to identify whether the positioning is done right and how to improve it, perhaps?
[00:10:43] TR: Yeah, yeah, absolutely. So actually, machine learning is used at various stages in this product pipeline. So, at the very first stage, machine learning is used in a segmentation model and object detection to find important, I call them anchor points, in the anatomy of the breast. So, for example, the muscle, the nipple, and the breast wall are all localized. And the next layer on top of that, then essentially builds and derives features that follow certain criteria on breast positioning. On top of that, we then have a predictive model that sort of gives a probability of how likely this image is to be a tech recall. So, there are multiple layers of predictive models here.
Now, in our version of the product where the radiographer is being trained, the radiographer can then also give us input on this as actually a criterion that I don’t agree with. Or this is the criterion that I couldn’t do any better because of X, Y, Z reasons. And that data is then fed back into our machine learning system.
[00:11:50] HC: How do you ensure that your models work for images from different scanners and different medical centers and all the other variations like that?
[00:11:58] TR: Yeah, that’s a great question. I’m happy you asked that. So, when we started building our products, and when we started building Mia, one of the things we said is we want Mia, and we need Mia to work for every woman everywhere. Whether they come from different centers, whether one center has a scanner from vendor A or vendor B, that shouldn’t matter. It should really be that every time a mammography exam is sent to Mia, the results are reliable.
So really, that journey starts with data and making sure you have a very diverse dataset. So, we didn’t only invest in getting a very large dataset to develop our models. But more importantly, we wanted a very diverse dataset. So, before actually starting the process of exploring data deals, we did a lot of research on our own to find, okay, which centers have what kind of hardware devices? What are the demographics? What kind of breast density distribution do we expect in the UK, for example? Do we have enough very dense breast cases or not? And if not, where in the world should we go to get that data?
So, we tried to get a very, very diverse dataset, where we look not only at demographic factors of genetic factors but the different machine hardware vendors that they have. But we went layers deeper, actually. So, having the hardware manufacturers as one variable is one thing, but different hardware vendors have different types of data machines, and even different versions of the post-processing software. And we made sure that we have a representative dataset of all different types.
So, this sounds like very time-consuming work. And it is, and we did that because we realized that this is what is needed to build a model that works for every woman everywhere. So, that was step one, like really, really investing a lot in our data. Then, I mean, we’ve done all types of work on using data augmentation techniques. We’ve done some work on generative models, and a couple of years ago, when the GANs just had been introduced. We were training GANs to generate new mammograms. That was really interesting. We’ve used that as sort of a data augmentation technique. I mean, we’ve learned a lot from that, especially since generative models are incredibly powerful. But they’re only as powerful as the data that you train them along, and they can encode and learn biases that are present in those datasets. So, when you use those techniques, I think it’s very important that researchers and machine learning engineers are aware of those potential biases, and we’re actively working to deal with that.
[00:14:42] HC: So, were those generative models helpful for your product in the end? Or was it more of a learning experience along the way?
[00:14:48] TR: I would say a bit of both. I think what we have realized instead is that the generative model helped us add a little bit, but it’s not a substitute for doing your first steps really well. Again, getting a very, very representative dataset and doing the work to build that dataset and getting it very clean, getting representative samples from a wide range of demographics, then doing proper augmentation techniques. And then once you do all of that, a generative model can help, but it’s not a substitute for the first steps.
[00:15:23] HC: So, I imagine that that representative dataset is also very important in going through the regulatory process. Are there key aspects of that or other influences of the regulatory process that affect the way you develop machine learning models?
[00:15:38] TR: Yeah, yeah, absolutely. And so, we have written a bunch about this, and one of the things that when we were speaking with regulators that we believe was really important is to get the dataset that you use for your regulatory submission, needs to be as representative as possible. And that creates a challenge in screening because screening is done on a large population.
In Europe, we have population-based screening programs, which means that between a certain age range, every woman gets an invite, typically every two years to get screened. So, we said, “We need to have a dataset just like that.” And what we did is, we took an entire screening population, and we didn’t create any sub-samples. So, we essentially said, “We are not going to sub-sample this dataset according to a set of positives and a set of negatives, we’re going to just uniformly sample across this entire set.” So, the clinical trial that we did consisted of more than 275,000 cases and that creates all kinds of challenges. But we argue that that is the way you need to do this. Because the challenge when you sample uniformly from your whole dataset is that there will be cases you’ve sampled, where you may not have ground truth.
So, for example, you will have a part of your dataset where you have biopsy information, and maybe that tells you that okay, there’s a sample of positives here. Great. Then you have a set of cases where you have enough follow-up to determine that this was indeed a negative case. So, what we do is we look at, after a certain amount of years of follow-up, if the follow-up case is also negative, then we say the first case wasn’t negative. But there’s a huge amount of cases that sort of sits in the middle that are part of your screening distribution, but where you can’t quite say whether it’s positive or negative. What other groups in this industry have done, I’ve just said, “Okay, well, that set. We’re just going to ignore, and we’re going to upsample the negatives by just continuously sampling from the approval negative set.”
But that’s problematic, because then you’re essentially creating a skewed distribution, and you’re saying, “Well, for the cases where we’re really certain, we’re just going to get more of these cases.” And those are essentially easier cases. So, we decided not to do that, and to essentially think about, we take this unfiltered sample from the screening distribution, and we’re just going to deal with that. And that meant that we had to think very deeply about what kind of metrics are we going to use in our evaluation.
So, sensitivity and specificity, you cannot measure if you don’t have a reference. But there are other metrics that are more specific to our industry that can be used. For example, a metric called the recall rate, which is very confusing should not be confused with recall. But the recall rate is essentially the number of positives that you find, divided by your total population. So, what is the percentage of cases in the population that I call positive? And hence, we decide to recall, and other samples’ cancer detection rates.
So, how many cancers do I find in 1,000 members of the population and those are unbiased, assuming you have a screening population. And we have essentially argued that this is the most important way if you want to show how your product will perform in the real world. Because in the real world, your model will run on a real screening distribution. So, it needs to be tested on a real screening distribution as well. [00:19:14] HC: Is there any advice you could offer to other leaders of AI-powered startups?
[00:19:19] TR: Yeah. I think one area that makes me very excited these days is when I look at some of the most successful AI companies, there has been a high degree of sort of vertical integration. This is something that we’ve been looking at as well, working with our partners to get that feedback loop of data coming back to us so that we can iterate very quickly and see where and how to improve the algorithm.
Another, I think, example that I think is a good one is Tesla. I have a huge amount of respect for the Tesla AI team. And I think what’s a nice model is that within the same company, they own sort of the cars or well, they don’t own the cars, but they have access to the data generated from the car, all the preprocessing methods, as well as the AI team. And on a very frequent basis, they can retrain models and release models into the wild and get feedback on those models very quickly. That allows them to very quickly improve performance. I think, other companies that have sort of closed that loop and create a sort of vertical integration have been quite successful in AI. I mean, some of the more consumer-focused apps, I think, are obvious examples.
I mean, think of apps like Instagram, where these companies have access to a huge amount of data as well as labels because they’re just being created by their users. But these are companies that have also been able to very quickly deploy into the real world and iterate on their data. So, I think when a company manages to close that loop, that’s when you can start really building your AI system, and start to automate the development of new models. So, that model development is not just done by a set of researchers in your group. But actually, model development is becoming a process that is engineered by your engineering teams, and model updates and models being pushed to deployment are all done automatically. I think when you have that, then the rate at which your models are approved will really accelerate.
[00:21:23] HC: I can certainly see the benefits of that feedback loop. How do you think that could fit in with medical applications? So, when you’re under the constraints of regulatory approval?
[00:21:32] TR: Yeah, that’s a great question. I mean, indeed, companies in this space have to adhere to the medical regulations. These regulations are there for a reason, of course. This is a very high-impact domain and the tasks that our models perform have a real impact on humans. So, that’s something that we have to work with. One thing that we have always done is when we look at our product portfolio is say, “Okay, are there any kind of models for applications that are lower risk, where, for example, we can close that loop a lot faster?” We talked about Mia IQ, the application for quality control, and image quality control. That’s actually a lower-risk device. And there, we’ve been able to actually deploy much faster, and the learnings of that have actually influenced our model development on our more highly regulated models.
The key learning for me there was, okay, we’re not just building single models here. But we’re building a portfolio of products. And which products can help us to learn more about the others is what I’ve found quite valuable. Another thing that we’ve done is making sure we have a very rigorous internal testing setup. This goes back to what I was saying about the need to have representative datasets for your clinical trials, the same holds true for your internal test sets. One thing we do is run these internal tests as if that was a screening operation, and deploy new models against that.
It’s not as good as getting live feedback. But at least it allows you to push models to some form of staging environment faster, and even to a real production dataset. So, there’s a concept that we’ve called Shadow Models, where a new model sort of sits behind the real model. The new model is not regulated yet. It also doesn’t actually write its results back to the user. But it’s just observing what it’s just looking at the real production dataset, and that already teaches us a lot about the model, and we can make improvements in the model. So, that’s one strategy that has worked well for us, and I can definitely highly recommend that.
[00:23:46] HC: Where do you see the impact of Kheiron in three to five years?
[00:23:49] TR: Yeah, I’m quite excited about what we’ll be doing in the next couple of years. This year, we saw some of the first feedback coming back from our deployments, where we’ve begun to find additional cancers that were missed by two radiologists. So, we have been able to find extra cancers. Next year, we really want to scale that up.
Again, we work in the screening domain. The great thing about that is that we work on large volumes of data. We’re working currently on a project in the UK where once we finish those deployments, we’ll be screening 500,000 women a year. At the extra rate of cancers that we have found we can really have a meaningful impact and find many additional cancers at an early stage. So, that’s something that we will see in the next coming years to really scale our technology to a larger set of women.
Longer term, we see opportunities for bringing screening to places where screening is not possible today. As I mentioned, the gold standard in breast cancer screening is to have double-blinded screening, but that’s of course a very resource-intensive system. You need two radiologists for each case. And funnily enough, in the majority of those cases, two radiologists agree with each other, because screening is highly imbalanced in terms of the proportion of cancers.
Luckily, in the screening distribution, less than 1% of cases are expected to have cancer. But that means that the two radiologists are doing a lot of redundant work. That means that even for many countries, screening is just not an option because they don’t have enough radiologists. And as a result, many cancers are found late. But even in countries where we do have screening programs today, people are worried about the sustainability of screening programs and how long we can actually maintain these screening programs and we want to help keep screening sustainable, and affordable, and bring it to places where currently people cannot be screened.
[00:25:50] HC: This has been great. Tobias, your team at Kheiron is doing some really interesting work for radiology and cancer detection. I expect that the insights you’ve shared will be valuable to other AI companies. Where can people find out more about you online?
[00:26:03] TR: Yeah, we’re active on Twitter. We do a lot of – our work is being presented at academic conferences. So, in the radiology community, the big radiology conferences are RSNA in Chicago, and the European equivalent of that is ECR. So, stay tuned there. We’re posting a lot of our academic work on those venues. But we’re doing more and more work also on bias detection and monitoring AI in the real world. And I think that’s an area I’m very excited about.
Because for me, when I started this company, this was not about building a great model that has a great performance on a test dataset. This is about getting AI into the real world. And when you get AI into the real world, you start to learn that AI is more than just a model, an AI system needs to consist of many other aspects such as monitoring how the model performs in the real world and what to do when datasets shift. I think COVID was a super interesting example for us because, during the pandemic, many screening programs actually stopped screening. And then when they resumed, of course, when people stopped screening, that doesn’t mean that cancer stops growing. So, when these screening programs resumed, now all of a sudden, the prevalence of cancer in the population changed. And also, the distribution of the sizes of cancers has changed because, of course, it was no screening for a while. So, cancer had more time to screen. That had an impact on the model. I mean, recall weights went up. Understandably, they had to go up because there were more cancers now. But what does that mean for being able to keep the consistency of these models? That has created a whole new line of work for us at Kheiron, that we’ll be sharing about a lot more.
[00:27:47] HC: Thanks for the great insights today. Thanks for joining me, Tobias.
[00:27:51] TR: Thank you for having me.
[00:27:51] HC: All right, everyone. Thanks for listening. I’m Heather Couture, and I hope you join me again, next time, for Impact AI.