Breast Cancer Screening with Stefan Bunk and Christian Leibig from Vara

Could there be a future where not using AI is considered unethical? With the growing efficiency created by AI support, radiologists are able to focus on the most important aspects of their work. During this conversation, I am joined by Stefan Bunk and Christian Leibig from Vara. Tuning in, you’ll hear about the essential practice of maintaining a high standard of data quality and how AI technology is revolutionizing breast cancer detection and treatment. We discuss the relevance of German innovation and research on a global community, and the step-by-step process that Vara adopts to test and introduce AI products. You’ll also hear about Stefan and Christian’s vision for the future of Vara. Don’t miss this episode, packed with powerful insights!

Key Points:

Introducing Stefan Bunk and Christian Leibig from Vara.
Vara’s mission for breast cancer outcomes in line with WHO’s Global Breast Cancer Initiative.
The role of machine learning in Vara’s technology.
What the AI technology predicts and the software that goes into this.
Why it is essential to maintain a high standard of data quality.
The relationship between images from earlier exams and current procedures.
How models are trained to manage different variations.
The relevance of German data for global application.
Why it is important to have strong processes around AI deployment.
What it means to run in Shadow Mode first and why Vara chooses to do this with AI products.
How they established the best way to integrate AI into the workflow.
The crucial role of trust in machine learning models.
Monitoring AI models constantly and creating the means to react quickly.
Where Stefan and Christian see the impact of Vara in five years.
The enduring goal of Vara: to support radiologists as they focus on the most important factors.
Considering the possibility that not using AI will become unethical in the future.

Quotes:

“Our ambition is to find every deadly breast cancer early. Breast cancer is the most common cancer actually worldwide, one out of eight women will have it at some point in their lifetime.” — Stefan Bunk

“At Vara, we want to empower health systems to systematically find more cancers much earlier and systematically downstage cancers.” — Stefan Bunk

“Amachine learning model can actually outperform a radiologist with a single image, but nevertheless, can still benefit from taking comparisons across images into account.” — Christian Leibig

“When you roll out a technology such as AI, which is the technology that is hard to understand, and you cannot always predict how it behaves in certain edge cases. We believe there must be strong processes around it wherever you will deploy your AI.” — Stefan Bunk

Links:

Stefan Bunk on LinkedIn
Christian Leibig on LinkedIn
Vara

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine learning powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.

[EPISODE]

[0:00:34] HC: Today, I’m joined by two guests, Stefan Bunk is the founder and CTO of Vara, and Christian Leibig is the Director of Machine Learning. We’re going to talk about breast cancer screening. Stefan and Christian, welcome to the show.

[0:00:47] SB: Happy to be here.

[0:00:49] CL: Hello. Thank you.

[0:00:51] HC: Stefan, could you share a bit about your background and how that led you to create Vara?

[0:00:54] SB: Sure, yes. I’m originally from Germany, and then went on to study machine learning and software engineering up until 2017. I did a bachelor, master there at the HPI in Potsdam, which is right next to Berlin. Yes, basically, always was interested in health during my studies. Already collaborated with the Charité Hospital here in Berlin. So this intersection of health and machine learning was always very interesting to me. Then, yes. After I finished my studies, I wondered, should I go down the PhD route, or the startup route, or the corporate route? I think the latter was out quite quickly, just because I always was very impact-driven. I wanted to have impact with my work. Basically, it was between PhD and startup.

But then again, just because I wanted to have impact very quickly, I went on a startup route, joined Merantix, which is a Berlin-based venture builder for AI startups in 2017. I was their first employee. Basically, starting building out their machine learning stack. That’s also where I met my cofounder, Jonas. Then yes, after a year of figuring out various different directions and hypotheses, we founded Vara in 2018.

[0:02:02] HC: Christian, what about you? What path led you to Vara?

[0:02:05] CL: Yes. I’m mostly from Germany, south of Germany. Originally, I’m a physicist by training, did theoretical and experiment with physics. I’m always trying to find something that I was truly passionate about, and finally founded, basically end of the career working in computational neuroscience. But in that field, basically, developing method, essentially using machine learning back then, before the deep learning hype in order to understand neural activity to disentangle synchronous neural recordings from semiconductor-based hardware.

Yes, basically, that brought me into programming, machine learning, et cetera. I found it very appealing, having done hard experiments before to have this much faster iteration cycle to connect the conceptual way and analytical way of drafting things with a way to implement, and quickly test, and iterate upon. I was always interested in building things because that really helps to understand. But somehow, there was always an interesting project when I was in academia. First of all, my PhD work, I produced a software that is nowadays distributed, and used in different labs around the world.

I moved on to do a postdoc at the University Hospital in Tuebingen. Tuebingen is a big place for machine learning and also new science happens. I, again, somehow happened to work with healthcare data, and maintain this route of applying machine learning to different use cases and had a focus on getting to understand how you can make neural networks better at knowing when they don’t know. Most specifically, I worked in applied Bayesian deep learning in order to quantify the uncertainty of neural network predictions in order to know when to make predictions and when to abstain from them. That was also very interesting, and also some industry projects, and so on and so forth. But essentially, it was an academic route, and I never had planned that. I definitely wanted to get out there, and I wanted to work more in a team, and have engineering being more important as opposed to just for a paper. I then also tried Merantix, and still an early employee. What was interesting there was because, it was basically also a very academic style of working. I firmly believe in approaching these sorts of problems with a scientific mindset. But at the same time, yes, I was working in a team and building something, et cetera. Then, when Vara was founded, so they’ve had a couple of different projects that’s kind of was a good fit. Because, first of all, it’s a vision problem, it’s a perception problem that is quite challenging, but in a quite standardized environment. So basically, I tried and stay there in that sense.

[0:04:42] HC: What does Vara do and why is this important for breast cancer outcomes?

[0:04:47] SB: Yes. Our ambition is to find every deadly breast cancer early. Breast cancer is the most common cancer actually worldwide, one out of eight women will have it at some point in their lifetime. The incident rate is only expected to increase. Breast cancer is a disease of older population, and with – in general, the world getting older, there’s an expected 30% increase of breast cancer incidence also by 2030. That’s a bad thing. Most cancers are also found at a late stage with poor survival chances and high treatment costs.

Breast cancer is actually in that sense a cancer that’s actually quite treatable. If you find it early, if you find it in stage one or two, it’s very easy to treat and also rather cheap. Basically, in line with WHO’s Global Breast Cancer Initiative, we at Vara, we want to empower health systems to systematically find more cancers much earlier and systematically downstage cancers. For that, we offer our AI-based population screening solution.

The solution we – being originally from Germany, we have developed it in Germany, and also together with the German nationwide breast cancer screening program, which is a very big program. Three million woman are screened every year, and roughly 40% of those already today are being screened on Vara’s platform. Over a million woman per year. Also in Germany, because you also asked about breast cancer outcomes. Specifically in Germany, we have currently actually just finished our prospective study. It’s the biggest prospective study in AI mammo screening with almost half a million women that were included in the study. That currently is being worked for to publish the data as well, to measure specifically breast cancer outcomes. So finding more cancers and avoiding false positives wherever possible.

[0:06:39] HC: What role does machine learning play in the screening technology?

[0:06:43] SB: For us, it’s a very, very central role, I would say, but also among others. So actually, it was a core hypothesis from our side, also, from very early on that we have to build the AI. Christian will tell a bit more about that in a second. But we also build around in the clinical workflow. The viewer, the worklist, the data monitoring, even down to the invitation management, and even the biopsy retrieval loop, [inaudible 0:07:05] all software pieces that we build around the ML part because we believe that, in the end, just ML is not a product. It’s not a product for radiologists to just get a prediction score, but we need to be very deeply integrated into the clinical workflow. But obviously, and kind of the ML part is at the heart of what we do, and at the heart of how we can improve metrics like cancer detection rate, and also reducing false positives.

[0:07:32] HC: Christian, maybe you can tell me a little bit more about machine learning, and maybe some examples of the types of models that you’re training, what are they predicting, what’s the end point going into them as well?

[0:07:41] CL: Yes, happy to do so. Let’s check on that. Basically, the core, the main system is basically a deep learning-based vision system, which gets as input mammograms. So a woman goes for a screen, and gets an x-ray taken. That is essentially at a point in time for images from either breast to fuse in order to get sort of, basically, to image tissue that would otherwise be hidden in a 2D only image single perspective. Four images, and they get screened, and if available, then also images from preceding screening. Screen rounds constitute the input, which are consumed by the models.

Ultimately, the tasks that we are solving is basically much like in principle, like the radiologists, though slightly differently. We want to detect cancer and basically make a recommendation whether that’s something treatment worthy or not in the images. Exactly the same what the screening system tries to do, but using slightly different labels, because it’s a very, very challenging task. Radiologists need special training to perform exactly that task, and also have to do a repeal via volume in order to maintain the certification. That’s also the reason why we put a lot of effort into how we label our data, which stuff and also talk a bit about – but essentially, cancer detection, which means, conceptually it’s simple. It’s a binary classification problem. Is there something? Yes or no?

Where it gets challenging is basically, number one, of course, stakes are high, and we have to reach a very high performance. I mean, things are changing all the time. Especially, a few years ago, there was already some hype that just might not be needed anymore. We were also always much more humble and focused basically on making some predictions really, really well and upstanding from others, which was also inspired by my prior research work. The other aspect, which often comes into place for – especially for healthcare systems is like, “Well, is your model actually explainable?” We tend to have a say, while explaining predictions, this might be difficult because explanations always come with simplifications, but something like in these large images pointing to where a cancer actually is. So localizing it is something that’s crucially important.

[0:10:01] HC: In training a system like this, you need annotations from radiologists. But even the best trained radiologists may not agree on something they see. Maybe one thinks it’s cancer, the other think it’s not. They’re not sure about some specific cases. How do you handle these types of disagreements and the annotations in your training data in order to train a model that’s robust?

[0:10:23] SB: Exactly. As you say, data quality is absolutely crucial. It’s the old garbage in garbage out topic. Making sure we get data quality right was a very big focus, even before we build our first models in 2018. The cancers are hand annotated by radiologists. So they draw pixel-based annotations, they draw polygons in the image that specifically highlight the lesions in the image that are cancers that are malignant. Very importantly, we only ever annotate images with biopsy-proven results. That’s very important, because in breast screening by design, you have many, many false positives along the screening chain.

To give you some numbers from Germany. In Germany, out of 1000 woman being screened, actually 880 will be fitted out right away as being normal. But 120 will be reassessed by a bigger group of radiologists, because there is something at least a little bit suspicious in the images that justifies a deeper investigation. Left was 120, then 40 will be reinvited. So will be reinvited to the screening facility for so called recall, where further investigations are being done. Ultrasound, or MRI, or another mammogram where they do a certain magnification or things like that.

Then still, even out of those 40, only 10 are being biopsied, and only six actually have a biopsy-proven malignant cancer. As you see, up to 40 women being reinvited, only six actually have cancer. There’s a high amount of false positives. So if you just follow what the radiologist think is suspicious, you have a high chance of actually creating quite a bit of level noise for your models if you don’t want to expose them to.

What we make sure is that, only really biopsy group malignant lesions are being added, and we only ever, ever show those to the model as cancerous data. Then even after that, even after you make sure that you only annotate that, kind of exactly as you say, there are still certain differences across radiologists. It might be just how big of a polygon they draw around things or whether they see a certain part of being – part of the cancer, yes or no. Even after that process, you still need to do a quality assurance with what radiologists are annotating.

For us, it basically means we run automated bench markings, where we just – kind of cases where we already know the annotations and we let radiologist kind of control whether radiologist actually can also produce those annotations. If not, we don’t use them for training. We also automatically compare to the reports. Kind of simple stuff, right? If the report says there was a lesion at the lower right quadrant in the image, but the radiologist actually entered it in the upper right then something is off, right? You also do a very rigorous sample testing on the annotations manually with specifically trained radiologist to make sure that data quality is really high there.

[0:13:28] HC: Yes. Like you said, it really comes down to data quality. I think the same is true across most applications in machine learning, not just medical. One of the things that you’d mentioned earlier is you include images of prior exams in your models. How do you use this temporal information to improve the accuracy of your models.

[0:13:46] CL: Maybe to begin with, how radiologists solve the problem is they, many times, actually compare images amongst one another. Not only over time, but also within the same example. Basically, comparing the two different fields in one breast, also comparing the same few across breasts, across lateralities and so on, and so forth. For a large chunk, we don’t necessarily have to build machine learning models, inspired by how humans do it. It can benefit but it doesn’t have to. But to give you some intuition, it’s certainly easier to spot a difference if you can compare two images as opposed to having to know in absolute terms whether single images are normal. Because that kind of – for that, you have to remember how also so on and so forth. Basically, you have to have an idea about the distribution of normality and abnormality. This is also where you can see that maybe a machine learning model can actually outperform a radiologist with a single image, but nevertheless, can still benefit from taking comparisons across images into account.

[0:14:47] HC: Another complication that I suspect you might see is variations across different patient populations, scanners, perhaps changes over time. How do you train your models so that they can handle those different variations?

[0:15:00] CL: That’s a good question. Again, it actually foremost starts with understanding the domain and the data. So basically, to begin with, it was quite clear early on that basically, it’s quite a new technology. Basically, there’s a condition for being able to deploy when it’s recommended and allowed. So to say, it’s basically, if the model is able to maintain or improve the respective metrics of the established system. For example, if you give such a tool to a single radiologist at hand, then you have to outperform or maintain their metrics. This is actually something that’s not so trivial to measure. Many, many publications out there that claimed superior performance a while ago, it’s very tricky. Basically, it depends on how you compose your dataset. You have to have a representative dataset for what you see in production, obviously.

We put a lot of effort into basically understanding the different components. So for example, which Stefan touched upon earlier, it’s like there’s different levels of suspiciousness that are triggered in a screening system. Or its first images are inspected by two radiologists, then there might be a consensus conference of multiple physicians, then maybe more imaging modalities might be invoked. This gives us a hint on how difficult the exams are actually to interpret.

We can, basically, from a country screening system program, we can take those statistics in order to build datasets that are representative for this dimension, like difficulty, but also other dimensions like scanners, age, density, et cetera. Once you have that, then you first of all, of course, you measure average performance. But then, you can also, in particular, in our case where we don’t reveal all predictions, but just confident ones, and defer other predictions to humans, we have to have and that’s also published. We have an end-to-end assessment, basically, how we measure performance there, and we basically stratify according to subgroups. And only, if models are good enough for each and every subgroup in the sense of data, at least it’s got this radiologist. That is basically the requirements that we have to fulfill in order to get there in terms of how you build your models to get there.

It’s also very important how you basically compose the training data distribution is one aspect. But also, other aspects can be, for example, to make sure that there is to prevent any sort of shortcut learning. To just give you an example, in the images, there might be certain texts, so if views are are called CC, and MLO, and laterality might be specified, and the font that is used. So it’s maybe the manufacturer, and by that, because machines are used for a long time at certain screening sites, basically, it’s a very strong correlation with local low prevalence statistics and so on. If you don’t make sure to suppress that signal via preprocessing that basically, comes to an image or presentation that is invariant over manufacturers. Then, you might easily pick up those signals, which are not that good. Basically, if you don’t measure it correctly, but see a high performance, even though that’s not true. So we take great care of suppressing these signals.

For example, then also like, which is standard in medical fields, and then also to perform external evaluations, where you really make sure that no shortcuts could be preserved in your validation or test sets. So really different units, different scalers, different humans working to produce it after.

[0:18:29] HC: Sounds like there’s two major components, a diverse data set and proper validation. And to do both, you need to really understand the data.

[0:18:37] SB: Exactly. I think another dimension to your questions also is the typical generalization topic. Our AI was mostly trained, trained on German data. So the natural question is, does it actually also transfer to more international populations and how are we making sure of it? I think that’s a very essential question to us. I think, also going to the subgroups that Christian just mentioned. For example, we know that women in Asia, they tend to have denser breasts, so they tend to have – in screening programs, tend to start at an earlier age, for example, in other countries as well.

The subgroups are kind of always very important to us, that even if kind of the input distribution changes in some other country, because we know that we improve results both in dense and non-dense breasts in young and an old woman. We know that even if the input distributions change, we can still perform well in our settings. We’ve actually also seen that our model also has transferred out of the box to, for example, the data from the UK, or data from Sweden. In some parts, actually even performing better than in Germany, which also surprised us.

I think kind of this model aspect of generalization was always very important to us. Then even beyond, it was also very, very important to us. I think it doesn’t stop at the modeling side. I think when you roll out a technology such as AI, which is the technology that is hard to understand, and you cannot always predict how it behaves in certain edge cases. We believe there must be kind of strong processes around it wherever you will deploy your AI.

Things we’re doing there is, for example, we really want to – also if you’re deployed internationally, we really want to understand the local clinical circumstances. In what center is AI being integrated? How will it be used? To make sure that it’s used correctly. We also always run it in what we call Shadow Mode at first. The AI at first runs in the background, doesn’t have any impact yet on decision making. We can really make sure it’s kind of well calibrated, and it’s behaving in exactly the same way as we observe it in Germany, for all the cases.

As well as these we’re also running quite some read studies locally in different markets, to really understand how we should calibrate thresholds as well. Because in the end, the goal is to improve kind of – with AI, improve upon the baseline that radiologists set in the various different markets. But the baseline might be at very different points. I mean, in Germany, the screening program, even before AI exists for 15, for 20 years. German radiologists are required to read at least 5,000 cases per year. This might not be the same in other settings. So really understanding the baseline as well, and understanding how we can tune AI to have the biggest impact is also very important to us.

[0:21:29] HC: How did you go about figuring out the best way to integrate your technology into the clinical workflow? What points in the process of a radiologist reviewing images? How did you figure out the right place to integrate it?

[0:21:42] SB: Yes. I think that comes back down to this, making sure the AI is only used when it really is confident. There is a certain baseline. The radiologists perform at that level, and then AI can make a decision for a case, or it cannot make a decision for a case. Really, the goal is, of course, precisely use AI for those cases where it is better, and then leave the rest for the two radiologists. I think that’s crucial. Then, even also monitoring that prospectively and having feedback loops, and having also dedicated local operations teams, which are very close with the radiologists, and which give us precisely that feedback. Is AI actually helping the radiologists as much as we nicely simulated retrospectively or other perspective topics that are stopping them from utilizing AI to the biggest advantage? That’s always very crucial to us as well.

[0:22:32] HC: Is trust in your machine learning models important for radiologists? And if so, how do you go about building this trust?

[0:22:38] SB: Exactly. I think it’s fundamentally important, because especially as long as you’re in this setting where AI is a decision support tool, then basically, the impact AI can have is always limited by the amount of trust the radiologist has. I mean, AI can make so many recommendations if radiologist fundamentally don’t trust the AI, and don’t trust the AI’s recommendation, then you basically get back to baseline performance, right?

We have quite a few things that we basically call trust building exercises with the radiologist. It starts even before going live with letting a radiologist have a say in how we configure, how we calibrate. There might be radiologists who prefer to reduce their false positive rates. There might be radiologist who want to prefer false negative rates. It’s two different settings that you can surely accommodate by calibrating the AI. Even post go live, even after radiologist use AI for a quarter, half a year, a year. I think constantly looking at cases, especially cases where AI and radiologists disagree, and learn from both sides. I mean, we know the biopsy always. We know who was right. I think there’s learning for us in there as an AI company to further improve our models. But there’s also learning and trust building in there for radiologists. For some cases, where AI was right. I think creating the platform for those discussions is something crucial for trust building.

[0:24:05] HC: Is there any advice you could offer to other leaders of AI-powered startups?

[0:24:09] SB: Sure. I think for us, first and foremost was really to go deep into the domain. Domain really is crucial. I think this problem shouldn’t be approached as a purely technical, let’s build a model problem. But in order to have the biggest impact, in our case, you will need to understand breast cancer screening really, really well with all the different tradeoffs that are made at various parts of the screening process.

I think the first advice I would give definitely is really going very deep on the domain and kind of not treating as a purely technical problem. I think the second one is also the before to really focus on the actual integration on the actual use cases. What a prediction isn’t a use case. A use case is, in our case, for example, filtering down the huge amount of normals to a much much smaller subset, such that radiologist can focus their time more on the harder cases, and kind of working on the integration, and investing into that, and working on the clinical workflow as well.

For us, it’s proven to be the right path. I think that’s the least, even after deploying, I think monitoring is crucial. Again, these AI systems, there can be distribution shifts, and there can be unexpected things you’ve never seen in training. I think constantly monitoring your AI, and also having the means to do that, being able to react quickly is I think also crucial.

[0:25:28] CL: Maybe one thing to add is also, I think confirmed me in entering that world, which is interesting. Especially maybe for like younger, iPod startups, because very initially, iPhone, personally, I find it quite challenging coming from the scientific background. Basically, you do R&D work, but the business world sometimes runs much faster. But both actually live in a world of uncertainty. So both the business model has to be figured out, as well as the models that the corporate problem has to be solved.

On all these levels, it actually tremendously helps to embrace a scientific working attitude. Basically, create alignment on stuff that know, what you don’t know, and place your bets in agreeing with your team members. And then working towards that and gradually reducing doing experiments to figure out, okay, how much data more might we actually need, et cetera. Gradually, that basically became part of the DNA of Vara. Maybe it hasn’t always been the case, but it has quite naturally to be the case, I think.

[0:26:32] HC: Those are both great pieces of advice. Finally, where do you see the impact of Vara in three to five years?

[0:26:39] SB: Yeah, it’s a really good question. Obviously, also a question we’re thinking about a lot. I think something we see right now, which we definitely weren’t sure when we started the company is that, I think automation, at least for a subset of cases is, I think, definitely on the horizon. For example, in Germany, each mammogram exam is inspected by two radiologists independently. Basically, if one of those find it suspicious, there is the follow up process that I described earlier. I think it’s definitely on the horizon that, for example, one or a subset of one of these two radiologists can partly be taken over by an AI to the benefit of I think everyone involved. Because radiologists again can then focus on the harder, on the more tricky cases.

I think there’s more and more evidence coming out now, also internationally, also not only from us. That AI in this very specific sub niche of breast cancer screening can definitely automate a certain part of radiotherapy workload in the future. It wasn’t always clear to us or wasn’t that clear to us, but I think it will come. There might even be a time in the future where it might not be ethical to not use AI, just because the evidence is pointing more and more in the direction of using the AI.

Especially, also for us, kind of with that, seeing that even in very mature and very developed screening markets, like the Western European markets looking into. I think a core vision and a core process for us always was to also scale this expertise, markets where this equity is missing. Where there is a certain lack of experience, radiologists that stop that screening program from being established in countries that don’t have a screening program yet, but they’re basically at the point where it would make sense. No, I think scaling this expertise is definitely something we foresee and we really want to work on in the next three to five years. [0:28:35] CL: Obviously, I hope that this will be an example of AI for health being a success. I think there’s so much happening in the field, and the field has changed tremendously in the last few years. I agree on automation will be a big topic. Ultimately, however, it’s still somewhat up to the user, like we’ve seen recently this hype about autonomous driving in California, and then some pics circling. At some point, you always have to hand over to a human and present something that has to be of high quality. Definitely interesting times ahead.

[0:29:06] HC: This has been great. Stefan and Christian, I appreciate your insights today. I think this will be valuable to many listeners. Where can people find out more about you and Vara online?

[0:29:16] SB: Yes. Our page is vara.ai. V-A-R-A .ai. I invite people to check it out and reach out to us.

[0:29:23] HC: Perfect. Thanks for joining me today.

[0:29:25] SB: Thank you, Heather.

[0:29:27] CL: Thank you very much.

[0:29:28] HC: All right, everyone. Thanks for listening. I’m Heather couture. I hope you’ll join me again next time for Impact AI.

[OUTRO]

[0:29:38] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe, and share with a friend. If you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]