Foundation Model Series: Advancing Endoscopy with Matt Schwartz from Virgo

What if a routine endoscopy could do more than just detect disease by actually predicting treatment outcomes and revolutionizing precision medicine? In this episode of Impact AI, Matt Schwartz, CEO and Co-Founder of endoscopy video management and AI analysis platform Virgo, discusses how AI and machine learning are transforming endoscopy.

Tuning in, you’ll learn how Virgo’s foundation model, EndoDINO, trained on the largest endoscopic video dataset in the world, is unlocking new possibilities in gastroenterology. Matt also shares how automated video capture, AI-powered diagnostics, and predictive analytics are reshaping patient care, with a particular focus on improving treatment for inflammatory bowel disease (IBD). Join us to discover how domain-specific foundation models are redefining healthcare and what this means for the future of precision medicine!

Key Points:

An introduction to Matt Schwartz and Virgo’s mission.
The importance of video documentation in endoscopy and its impact on healthcare.
Machine learning’s role in automating endoscopic video capture and clinical trial recruitment.
Building the EndoDINO foundation model to unlock endoscopy data for precision medicine.
Data collection: the process of gathering 130,000+ procedure videos for model training.
Foundation model development using self-supervised learning and DINOv2.
Model development challenges, from hyper-parameter tuning to domain-specific adjustments.
Applying EndoDINO to predict inflammatory bowel disease (IBD) treatment responses.
Commercializing EndoDINO through licensing to health systems and pharma companies.
The future of foundation models in endoscopy: expanding applications beyond GI diseases.
Advice for AI startup founders to prioritize data capture as a foundation for AI success.
Insight into Virgo’s vision to transform IBD treatment and preventative care.

Quotes:

“There's a massive amount of endoscopic video data being generated across a wide range of endoscopic procedures, and nobody was capturing that data – [Virgo] realized early on that endoscopy data could hold the key to unlocking all sorts of opportunities in precision medicine.” — Matt Schwartz

“With the foundation model paradigm, you can compress a lot of heavy compute needs into a single model and then build different applications on top of the foundation. This is going to have a positive impact on the clinical deployment of foundation models.” — Matt Schwartz

“Our foundation model can turn something like a routine colonoscopy into a precision medicine screening tool for IBD patients.” — Matt Schwartz

“There are a lot of untapped data resources in healthcare. If a founder can build a first product that is the data capture engine, it will set them up for a ton of future success when it comes to AI development.” — Matt Schwartz

Links:

Virgo
Matt Schwartz on LinkedIn
Matt Schwartz on X
EndoML
Introducing EndoDINO: A Breakthrough in Endoscopic AI

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[0:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven machine learning-powered company.

This episode is part of a mini-series about foundation models. Really, I should say domain-specific foundation models. Following the trends of language processing, domain-specific foundation models are enabling new possibilities for a variety of applications with different types of data, not just text or images. In this series, I hope to shed light on this paradigm shift, including why it’s important, what the challenges are, how it impacts your business, and where this trend is heading. Enjoy.

[INTERVIEW]

[0:00:49] HC: Today, I’m joined by guest, Matt Schwartz, CEO and Co-Founder of Virgo, to talk about a foundation model for endoscopy. Matt, welcome to the show.

[0:00:57] MS: Thanks so much. It’s great to be here. I’ve enjoyed following the foundation model series.

[0:01:02] HC: Matt, could you share a bit about your background and how that led you to create Virgo?

[0:01:05] MS: Yeah, sure thing. I’m a biomedical engineer by training and spent my career in medical device product management. Immediately, prior to founding Virgo, I was a product manager at Intuitive Surgical, working on their next-generation Da Vinci robotic surgery platform. This is background 2015, and I started to become just obsessed with machine learning and particularly computer vision. It was pretty early on as I was just training some toy models, the light bulb went off when I realized there’s a massive amount of endoscopic video data being generated across a wide range of endoscopic procedures, and nobody was capturing that data. That was really the spark for Virgo. We realized early on that endoscopy data could hold the key to unlocking all sorts of opportunities in precision medicine. We just needed to get started by capturing the data.

[0:01:54] HC: What does Virgo do, and why is it important for healthcare?

[0:01:57] MS: Virgo now provides the leading video capture and management platform for endoscopy. We’ve helped leading medical centers, like the Cleveland Clinic, Mount Sinai, UMass, and the University of Chicago capture nearly 2 million procedure videos and counting, and to our knowledge, that’s the largest data set of its kind in the world. I’m sure we’ll get into the impact of foundation models and what we’re doing there in precision medicine. But we actually think that just this improvement in documentation is really important for healthcare generally.

When I talk to people, they’re often surprised to learn that if they go in for a routine colonoscopy, or upper endoscopy, unless their health system has Virgo installed, the doctor doesn’t actually save the video of their procedure. They may only save a handful of still images that don’t necessarily tell a complete story of the procedure.

In the past, it was technically and economically impractical to save everybody’s complete endoscopy video, but Virgo has changed that, and we’re really proud to be changing the paradigm in endoscopy toward default, where we record complete video documentation of every procedure. The reason, this is important just from a documentation perspective, to give you some examples, we now routinely learn of cases where physicians will refer back to a prior procedure video to evaluate a patient’s progress. Sometimes doctors are referring a patient to surgery, and the endoscopy video actually helps the surgeon with preoperative planning. We’ve actually even learned of cases where a saved endoscopy video in Virgo has helped a surgeon determine that a cancer patient who was previously deemed non-operable could in fact receive lifesaving surgery. I guess, before all of the exciting things we’re doing on the AI front, just improving overall documentation and healthcare we think is really important.

[0:03:41] HC: Well, all the data that you’re collecting, all the images and other associated metadata is of course, the fundamental resource needed for machine learning. What world does machine learning play in your technology?

[0:03:51] MS: Yeah. So, from the earliest days of Virgo, we recognized that we needed to automate the video capture process if we wanted to achieve universal capture of all procedures. Doctors are obviously very busy, the entire clinical team is very busy, no one has the bandwidth to be focused on video capture. It turns out, determining when a procedure starts and stops isn’t always obvious or trivial, and it’s actually a great use case for machine learning. We developed a patented process for automating the start and stop of video capture using machine learning, and we call this auto procedures. This works on our video capture devices and automates video recording for physicians.

We then later developed a machine learning system, a model called Auto IBD that analyzes the entire endoscopy videos from across a health system and flags patients who are potential candidates for inclusion in IBD clinical trials based on their endoscopic presentation. This allows a health system to easily identify patients who otherwise might have slipped through the cracks and hopefully, present them with an opportunity to participate in clinical research. Then, now that we’ve developed our first foundation model for endoscopy, we’re starting to get into some really interesting next-generation precision medicine use cases, but that’s a little overview of where machine learning plays a role in Virgo today.

[0:05:13] HC: Let’s talk some more about the foundation model. Why did you build a foundation model?

[0:05:17] MS: Well, fundamentally, we think endoscopy is a vastly underutilized data resource in healthcare. Endoscopic procedures, they can range from three minutes to three hours. But in all these cases, there’s a tremendous amount of high-dimensional and information-rich data that gets generated. Historically, this data would only be consumed by the physician actually performing the endoscopy, so that they could make real-time decisions. But we think there’s a lot of information contained within an endoscopy about a patient’s phenotype. This covers a wide range of anatomy, all the way from the mouth to the colon. The GI tract is heavily vascularized and innervated. For certain diseases, like inflammatory bowel disease, it’s already well-known that there are visual morphologic changes that occur when a patient has a specific disease.

We’re doing a lot of thinking about this. Ultimately, we believe that a foundation model for endoscopy is going to have two key benefits. The first is that it’s going to allow for the extraction of this type of rich information that we think can lead to all sorts of precision medicine discoveries. Some of these discoveries might support, or replicate diagnostic work that is currently performed by physicians. We think there’s actually a real chance that it’s going to unlock capabilities that even exceed what’s possible with the eye of a human expert physician.

The second benefit of a foundation model is that we think it can help to standardize the development of downstream AI applications. Gastroenterology has actually been a really rich field for AI in medicine. I think the last time I saw GI actually leads all clinical specialties for the number of AI-related randomized controlled trials. Right now, all of these solutions in gastroenterology are being developed in silos. With the foundation model paradigm, you can compress a lot of the heavy compute needs into a single model and then build all sorts of different applications on top of the foundation. We think this is going to have a really positive impact on the actual clinical deployment of foundation models, where you may want to run polyp detection and Barrett’s esophagus detection, and who knows what other additional models all at the same time?

[0:07:30] HC: How did you gather the data that you need to train this foundation model and how much data did you need to collect?

[0:07:36] MS: We call our foundation model EndoDINO. It’s really the result of the work we started doing with video capture way back in 2017. Endoscopy data, it’s somewhat unique in the healthcare landscape in that the video data just doesn’t exist anywhere else in the medical record. So, traditionally, folks weren’t actually capturing the video data. Our video capture platform, Virgo Cloud has been the driver for foundation model data. We offer Virgo Cloud as a product that physicians can use to gather and actually utilize their own data in clinical practice, or research, or training. As a byproduct, it generates this really massive data set. We are coming up on 2 million procedure videos recorded and growing close to a million procedures per year.

For version one of our foundation model, we wanted to prove the concept and demonstrate the initial capabilities. We use a subset of just over 130,000 procedure videos, roughly equates to three and a half billion total frames of endoscopy. Then from that, we use some special techniques to curate the data into data sets ranging from 100,000 frames, all the way up to 10 million frames. Those are the frames we actually used in the foundation model development.

[0:08:51] HC: This data is the first key component to building a foundation model. What else goes into building one?

[0:08:56] MS: Yeah. Architecture selection is obviously a huge component here. We are a relatively small team here at Virgo. This is our first effort at building a foundation model. That had us stepping out into the frontier of self-supervised vision modeling. We are very grateful for the active open-source community that exists in self-supervised computer vision. We’ve even had the chance to directly collaborate with some of the pioneering research labs in the field. At Virgo, we’re not really focused on pushing the frontier from an architecture perspective. We’re much more interested in applying this unique data set that we have to known architectures.

With that, we learned a ton along the way about how to leverage some of these key architectures we’ve been really focused on, working with the DINOv2 training paradigm, and has worked really well. I think there are certain things where we’ve just found that you have to make adjustments based on the specific needs of our domain. There’s been a lot of learning along the way. Yeah, I think the key ingredients for us are the data, the architecture in DINOv2, and then figuring out how to mash those together to get some good outcomes.

[0:10:08] HC: What are some of the challenges even countered in building this foundation model?

[0:10:12] MS: I think a lot of it has been around just hyper parameter tuning and some key decision making around that. When we’re working with images, we have to decide what size of images. There’s some good literature out there for other domains that are analogous to varying degrees. That helped to guide our decision-making process. The vast majority of the literature is in hyper-parameter tuning and model training for natural images that there are, I think, carryovers to what we do, but in certain cases, there are these key difference that are really important to understand.

I think the challenge is a lot of it’s just trial and error and experimentation to figure out what sorts of hyper-parameters and specific decisions around model setup are going to be most meaningful for us.

[0:10:59] HC: This may sound like a basic question, but how do you know whether your foundation model is even good?

[0:11:05] MS: It’s actually a really good question and one that we spend a lot of time talking about internally. For us, the key is to understand which downstream tasks matter the most. That helped guide our design of training the foundation model, but also selection of checkpoints once we were training the model. We specifically looked at some of the standard benchmark tasks in gastroenterology, which include things like, classifying different anatomical landmarks, segmenting polyps, and scoring disease severity in ulcerative colitis.

One of the things we saw very clearly is that the training loss of the foundation model did not perfectly correspond with performance on these downstream task evaluations. It was important for us to build a pipeline where we could train the foundation model, and shortly thereafter, evaluate against these different benchmarks so that we could pick the optimal checkpoint. I think this process will continue to evolve as we uncover some new sorts of downstream tasks that we want to evaluate on. It’s also really important to have a well-defined evaluation task and evaluation data set to test on.

[0:12:20] HC: How are you currently using your foundation model and how do you plan to use it in the future?

[0:12:25] MS: The thing we’re most interested in right now is actually whether we can use our foundation model to analyze the baseline colonoscopy of a patient with either Crohn’s disease, or ulcerative colitis and predict their likelihood of responding versus not responding to either placebo, or a specific drug. We’ve got access to some clinical trial data from a phase three clinical trial for ulcerative colitis from an approved IL inhibitor. Our early results are really promising here, suggesting that we can in fact predict likelihood of placebo response, or response to this specific drug. We think this is incredibly exciting.

If you think about there’s a nice analogy to oncology, where in oncology, most drugs today, I think it’s now over 50%, maybe over 60% of new drugs in oncology are approved with some companion diagnostic. Say, a genetic screening that the patient needs to pass to be eligible for the drug. But this is not the case and actually, very much so not the case for inflammatory bowel disease. We unfortunately don’t really have good precision medicine tools in IBD and the results reflect this pretty clearly.

Even for the best IBD drugs, only 40% or maybe 50% of patients actually achieve remission in year one. The placebo rates are also as high as 20% or 30%. There isn’t a ton of treatment efficacy and this has all sorts of downstream impacts from high trial failure rates. Most importantly, just the drugs not working very well for patients. The tricky thing is we don’t really know why certain patients respond and others don’t, and we think our foundation model can turn something like a routine colonoscopy into a precision medicine screening tool for IBD patients. We think there’s actually some exciting future applications even beyond IBD as well.

[0:14:22] HC: Have you planned to commercialize your foundation model?

[0:14:25] MS: Right now, we’re making the foundation model EndoDINO available to select health systems that already use Virgo for video capture. We’re making it available through a platform called EndoML. EndoML is a full-stack AI development tool that’s meant to help clinicians who maybe don’t have as much experience with training machine learning models to run quick experiments and actually train and deploy their own AI models on top of EndoDINO. Our foundation model does the heavy lifting. The physicians bring their expertise to the table and their desire to build downstream models and we can put those two things together.

Physicians might be interested in building their own model for something like, real-time polyp detection, or Barrett’s esophagus detection, or even maybe a more niche application, like procedure quality assessment, and EndoDINO and EndoML are tools to help them achieve that. We are also licensing EndoDINO to select pharmaceutical companies who can process their own clinical trial data to develop precision medicine models for their specific drugs, very similar to the work that we’re doing with some past clinical trial data.

[0:15:34] HC: Are there any lessons you’ve learned in developing foundation models that could be applied more broadly to other types of data?

[0:15:40] MS: I think the thing that we’ve definitely learned to be most focused on is just the power of self-supervised learning and working with unlabeled data. We’re definitely biased in this sense, because the preponderance of data that we have is not annotated. But I think a lot of times, folks will get hung up on the pristine quality of their data and the need for highly refined annotations, and I think we’re seeing in more and more domains that you can really benefit from foundation models that are pre-trained on just massive quantities of unannotated data.

We’ve just been leaning into that more and more, focusing on areas where self-supervised learning can play a big role. Then, I think the other thing is just to look for analogous work in the literature. Like I said, we’ve benefited from a lot of the open-source work that’s out there and then we learned a lot from groups that were applying DINOv2 in radiology and pathology. And so, we’re always on the hunt for analogous work that can help to shortcut some of our design and decision making.

[0:16:48] HC: What does the future foundation models for endoscopy look like?

[0:16:51] MS: EndoDINO was definitely a version one for us. We actually only use about 5% of our overall data set. There’s absolutely an immediate opportunity to scale things up. We also have some, I think, pretty interesting ideas about how to unlock new capabilities for version two, and part of that is around refining the architecture and the data curation. I suspect that the future for EndoDINO actually looks something like turning a routine endoscopy into a rich source of biomarkers for a wide range of diseases. We’ve been really inspired by the work that’s been done in retinal imaging foundation models. We think that the GI tract could actually help detect biomarkers and even predict outcomes related to a wide range of diseases that go beyond just GI.

We think there’s a pretty clear opportunity to investigate signals in obesity, type 2 diabetes, chronic kidney disease, liver disease, and even neurodegenerative disorders. As we continue to add more pre-training data and just increasingly train foundation models that are more and more powerful, those are the sorts of opportunities that we think will open up.

[0:18:04] HC: Thinking, perhaps, beyond foundation models and more towards your role as a founder, is there any advice you could offer to other leaders of AI-powered startups?

[0:18:12] MS: Yeah, I think the data is key. It’s obvious, but I see a lot of startup founders that they have some great concept of AI that they want to develop. I think they skip the step about where the data is going to come from. I think in healthcare in particular, there are a ton of opportunities to become your own data capture engine. I think a lot of people focus on data that exists today, whether it’s in the medical record, or pathology data, or radiology data. I’m a pretty big believer that there are a lot of untapped data resources in healthcare. This was the path we took with Virgo. Way back in 2017, we saw an opportunity to build a data capture engine in endoscopy. I think if a founder can do something to build a first product that actually is the data capture engine, it will set them up for a ton of future success when it comes to AI development.

[0:19:11] HC: Finally, where do you see the impact of Virgo in three to five years?

[0:19:14] MS: I’d say, we’re very excited about this opportunity in inflammatory bowel disease. It’s just such a massive chronic condition that is crying out for precision medicine and better treatment options for patients. It will take time for the work we’re doing now to actually matriculate its way through clinical validation and then actual clinical adoption. I think there’s a real chance that we can have a huge impact on the field of IBD research and treatment in the next three to five years.

At the same time, we, like I said, would love to see routine procedures, like upper endoscopies and colonoscopies being used for much more than just say, colorectal cancer screening where it’s used today. These procedures are incredibly high volume. If we can use them to help inform long-term preventative care, we think that’d be a huge advance for the field.

[0:20:04] HC: This has been great, Matt. I appreciate your insights today. I think this will be valuable to many listeners. Where can people find out more about you online?

[0:20:12] MS: Yeah, sure thing. People can feel free to reach out on X @MattZSchwartz. Feel free to find me on LinkedIn, or just shoot me an email, just [email protected].

[0:20:24] HC: Perfect. We’ll link to all of those in the show notes. Thanks for joining me today.

[0:20:28] MS: Thanks so much, Heather. This was fun.

[0:20:30] HC: All right, everyone. Thanks for listening. I’m Heather Couture and I hope you join me again next time for Impact AI.

[END OF INTERVIEW]

[0:20:39] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. If you’d like to learn more about computer vision applications for people and planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]