Zelda Mariet, Co-Founder and Principal Research Scientist at Bioptimus, joins me to continue our series of conversations on the vast possibilities and diverse applications of foundation models. Today’s discussion focuses on how foundation models are transforming biology. Zelda shares insights into Bioptimus’ work and why it’s so critical in this field. She breaks down the three core components involved in building these models and explains what sets their histopathology model apart from the many others being published today. They also explore the methodology for properly benchmarking the quality and performance of foundation models, Bioptimus’ strategy for commercializing its technology, and much more. To learn more about Bioptimus, their plans beyond pathology, and the impact they hope to make in the next three to five years, tune in now.


Key Points:
  • Who is Zelda Mariet and what led her to create Bioptimus.
  • What Bioptimus does and why it’s so important.
  • Why their first model announced was for pathology.
  • Zelda breaks down three core components that go into building a foundation model.
  • How their histopathology foundation model is different from the number of other models published at this point.
  • Their methodology behind properly benchmarking how well their foundation model performs.
  • Different challenges they’ve encountered on their foundation model journey.
  • How they plan to commercialize their technology at Bioptimus.
  • Thoughts on whether open source is part of their long-term strategy for the model, and why.
  • Developing a product roadmap for a foundation model.
  • She shares some information regarding their next step, beyond pathology, at Bioptimus.
  • The importance of understanding what kind of structure you want to capture in your data.
  • Where she sees the impact of Bioptimus in the next three to five years.

Quotes:

“Working on biological data became a little bit of a fascination of mine because I was so instinctively annoyed at how hard it was to do.” — Zelda Mariet

“Bioptimus is building foundation models for biology. Foundation models are essentially machine learning models that take an extremely long time to train [and] are trained over an incredible amount of data.” — Zelda Mariet

“There are two things that are well-known about foundation models, they’re hungry in terms of data and they’re hungry in terms of compute.” — Zelda Mariet

“On the philosophical side, science is something that progresses as a community, and as much as we have, what I would say is a frankly amazing team at Bioptimus, we don’t have a monopoly on people who understand the problems we’re trying to solve. And having our model be accessible is one way to gain access into the broader community to get insight and to help people who want to use our models, get insight into maybe where we’re not doing as well that we need to improve.” — Zelda Mariet


Links:

Zelda Mariet on LinkedIn
Zelda Mariet
Bioptimus


Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.


Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. This episode is part of a miniseries about foundation models. Really, I should say, the main specific foundation models.

Following the trends of language processing, the main specific foundation models are enabling new possibilities for a variety of applications with different types of data, not just text or images. In this series, I hope to shed light on this paradigm shift, including why it’s important, what the challenges are, how it impacts your business, and where this trend is heading. Enjoy.

[INTERVIEW]

[0:00:48.9] HC: Today, I’m joined by guest Zelda Mariet, co-founder and principal research scientist at Bioptimus, to talk about foundation models for biology. Zelda, welcome to the show.

[0:00:58.5] ZM: Thank you so much for having me, Heather.

[0:01:00.3] HC: Zelda, could you share a bit about your background and how that led you to create Bioptimus?

[0:01:04.3] ZM: Yes, happy to do so. So, I have a background in statistics and machine learning. I did my Ph.D. at MIT where I worked on mathematical representations of diversity, specifically thinking about things such as, “How does the data you train on impact model performance, how do you need to think about that when you build a model?” And after my Ph.D. I joined Google, where I worked mostly on the theory of machine learning.

But I actually was also brought in to do some 20% work on the protein engineering team, and the reason for that is it turns out that actually understanding how diversity impacts models you train is extremely important for things like protein engineering. As I was doing that 20%, I think I had a little bit of a “come to Jesus moment” in a sense, where I realized how difficult it can actually be to get a machine learning model to work well on data that is so much more messy than you know, the typical data that I was used to working with as a theorist.

And working on biological data became a little bit of a fascination of mine because I was so, you know, instinctively annoyed at how hard it was to do, and I became really fascinated with that topic, and over the course of the years, you know, LLMs became you know, mainstream topic, transformers kind of took over the entire machine learning world, and as that was happening, I essentially became convinced that perhaps even more than the structure that exists in language, the structure that we see within biology is the question that can hopefully really be addressed with attention mechanisms, essentially.

And of course, you know, it’s somewhat tried to say but it’s true, it’s such an important question to solve if you can understand biology, if you can speed up treatments discovery, and so on and so forth, the impact for humanity is also huge. So, when the opportunity came to be part of Bioptimus, it was a little bit of a no-brainer.

[0:02:54.1] HC: So, what does Bioptimus do and why is it so important?

[0:02:57.1] ZM: Great question. So, Bioptimus is building foundation models for biology. Foundation models are essentially machine learning models that take an extremely long time to train and are trained over an incredible amount of data, in our case, medical data, and the idea is that these models, once they’ve gone through this extremely intensive training process, which is also costly in terms of time and money, they can actually be used for a variety of downstream applications, right?

So, you could use them, for example, for drug discovery but you could also use them for things like cancer subtyping and so on and so forth. So, what Bioptimus is doing is creating these foundation models and what we do to this may be a little bit unusual, compared to the rest of you know, this environment is that we build foundation models that connect all of the scales and modalities of biology.

People have seen foundation models that have been built specifically for proteins, specifically for computational pathology, specifically for single-cell data. What we’re doing is we’re trying to connect all of these different modalities together so that we can essentially have a foundation model that will be holistic, that can transform biological discovery, and accelerate breakthroughs in medicine.

And in terms of why doing so is important, I’m not going to go into you know, like, why we believe the accelerating medical research is important. I think this is self-explanatory but obviously, I’m happy to elaborate, but I’d say what’s important in terms of what we’re doing specifically is really the small time modal multi-scale aspect, where we’re trying to connect biology at all of its different levels, and all of its different scales.

Because if you think about biology and you’re only looking at the single modality on its own, fundamentally, what you’re doing is you’re blinding yourself. If you’re only looking, for example, of proteins, right? As a mutation, it’s going to have of course a fundamental aspect and consequence on how data they look and how the protein interacts with its environment but that mutation can also have downstream consequences in terms of how easy the protein is going to be to create, how it uses to synthesize what side effects it’s going to have, and so on and so forth.

This is something I saw working on protein engineering, where it’s very – well, I’m not going to say, very easy but comparatively, it’s quite easy to say, “You know, I want to improve,” for example, “This antibody. I wanted to have this specific binding structure,” for example but what you can’t keep in mind also is that as you make these modifications to improve binding, for example, it might change, whether or not it’s actually possible to create this molecule in a laboratory, and actually you know, deploy it as a treatment for the general public.

If you’re able, on the other hand, to actually take all of biology into account, as you’re doing these manipulations, all of a sudden, you don’t have to pretend that these problems don’t exist. You can actually really target them at the same time, and take all of the aspects of biology from you know, like patient allergies, to synthesizability and so on and so forth, as a single process essentially, and that’s something that we can do now with multimodality that is not really doable, without that perspective.

[0:06:14.9] HC: The first model that you’ve announced is for pathology, why was this the starting point?

[0:06:18.5] ZM: Great question. For a couple of reasons. I’d say, first of all, pathology is one of the crucial pillars that we need if we want to attack multimodality. It’s also something that a lot of our initial cofounding team had not really worked on before, so it was also, you know, a great way for us to build up our skillset. It was also a test of our infrastructure to be clear if we wanted to see that what we’d been building for the first couple of months was robust enough to support histopathology in particular because pathology has a structure in its data that is, I would say, very particular, especially to me coming from more of a machine learning background, where the data that you work with is so heavy, the images are so large, it was a really great way to stress test everything we had built, and additionally, it was a great proof of concept in terms of you know what our team can accomplish, where we’re going, where we’re headed.

And it’s also a field where we didn’t have some sense of how we can evaluate our model, you know, what a good performance is going to be, versus not. So, for all these reasons, histopathology was a straightforward choice, I would say.

[0:07:29.2] HC: What does it take to build a foundation model? What are the core components and the important pieces that go into this?

[0:07:35.3] ZM: I’d say that there are three major components, and then obviously, these components can be broken down even further but obviously, data. If you want to build a foundation model, you need to have the data to feed it. So, in our case, it’s going to be medical data. If you’re talking – so, you’re talking about earlier about our foundation model for pathology, H-optimus-0. For that, it was histopathology is like HNE slides, essentially.

The second aspect is compute. There are two things that are well-known about foundation models, they’re hungry in terms of data and they’re hungry in terms of compute. The infrastructure that’s required, the computational power, and it’s, very plainly, the cost that gets sunk into GPUs is incredible, and so you need to be able to access huge amounts of compute for extended periods of time with resiliency to failure, restarts and so on and so forth.

And then, very prosaically, I’d say that the third aspect is talents. The talent that is able to build foundation models from the ground up I would say is still quite rare across the world, and knowing – having people on your team who actually have the specific skillset that’s required to understand the data and specificities, to understand what can go wrong in the infrastructure, how to prevent it to how to be resilient to it, that talent is still very difficult to find and is one of the major bottlenecks.

[0:08:58.9] HC: In the histopathology area where you started, there’s a number of other foundation models out there, there’s more than a dozen published at this point. How’s your foundation model different than the others?

[0:09:08.4] ZM: Great question, and indeed, there are quite a few other foundation models and that’s also one of the reasons why we were able to benchmark our model. That was one of the reasons that we also looked at histopathology but I’d say in terms of differences, maybe the major – the biggest one is that our model is the largest open-source model in computational pathology.

It is fully open source, it’s under the Apache-2.0 license. I can also talk about, you know like how well it performs on a variety of different tasks but if I were to pick one thing, I would say it’s its broad availability to the scientific community.

[0:09:42.0] HC: And you mentioned performance there. How do you know that your model is good? Should you just benchmark it as many tasks as you can or is there some methodology to figure out the proper way to benchmark it?

[0:09:52.4] ZM: That is a great question, and I would say that, first of all, the question of benchmarking foundation models in general but in computational pathology as well is actually very difficult. I would say it’s still in a lot of ways an open problem. There’s no universally accepted set of tasks but there are a lot of tasks, right? And so, one thing that we did, obviously, is both is benchmark on public benchmarks.

We also had of course our own internal benchmarks that we looked at but one of the advantages of being fully open source is also that even if we do our own benchmarking, by being open source, other people can do it as well. People can replicate our results, people generalize them, and so some of the things that came out of having our model fully open source is that we saw some early benchmarks done by other labs, other companies.

For example, on the data set called HEST, the Mahmood Lab at Harvard showed that we had competitive actually state-of-the-art performance in predicting gene expression from slides, for example. There have been some other studies including by – the Thomas Fish Lab that benchmarks our model against competitors and in a sense, it’s really nice to have this ability where we know that we’re not somehow reading the tea leaves into how our model performs.

By having this model open source, we can rely also on the general scientific community to see how well it does and how well they find it’s to perform on tasks of interest on their end as well.

[0:11:16.3] HC: What are some of the challenges you’ve encountered in building foundation models?

[0:11:20.2] ZM: What are some challenges that I haven’t encountered? There’s some, I would say some obvious stuff. There is the standard infrastructure questions. As I said, you have to be resilient to things like GPUs, machines going offline, resuming data corruption. The very, very real problem of data storage as I mentioned, slides are extremely large, and you have to pay a cost for storing them for accessing them.

There is when you’re talking about histopathology QC is also really important, you want the data quality to be high enough that your model is going to actually learn something meaningful. We spend a lot of time on QC as well, and then we just talked about this but I would say that understanding or at least making an educated guess on the right way to benchmark and evaluate model improvement is also extremely difficult.

We don’t really haven’t accepted yardstick essentially for you know, like if you improve on this task you’ve done it, and there’s understanding which tasks are saturated, which tasks are prone to what we call overfitting, all of the minute work that comes with benchmarking a model, understanding whether or not it’s improving I would say is quite a challenge in general and also in our case for histopathology.

[0:12:36.6] HC: You mentioned open source already but what about the commercial side of this? Bioptimus is a for-profit company, so how do you plan to commercialize this technology?

[0:12:44.5] ZM: Great question. So, indeed, we do provide some or part of our foundation models for free for non-commercial use and part of the reason for that is so that we can build a large community around foundation models in biology. However, we are planning on licensing our models to pharmaceutical tech and tech bio companies including our culture food and energy and really any business is operating in the field of biological discovery.

In terms of what that looks like, additionally to licensing the model, we also have the plan of training or fine-tuning foundation models on specific customers’ data as well as exploring specific downstream applications with privileged companies as well.

[0:13:28.6] HC: And on the open source side, you know obviously for these models already open source, will that continue as part of the long-term strategy?

[0:13:35.4] ZM: It is indeed, yes, and you know there is I’d say a philosophical reason behind that but also a very practical reason. On the philosophical side, science is something that progresses as a community and as much as we have what I would say is a frankly amazing team at Bioptimus, we don’t have a monopoly on people who understand the problems we’re trying to solve.

And having our model be accessible is one way to gain access into the broader community to get insight and to help people who want to use our models, get insight into maybe where we’re not doing as well that we need to improve but more prosaically, also having a foundation model that’s open source also makes it more easily accessible, right? Someone who is used it because it’s open source is more likely than to want to use maybe a bigger version.

You know, like we’ve released a smaller model but are more expensive models that we’re expensive to train and to provide will be under commercial license and then those if they’ve seen our smaller models perform well, they’re much more likely to want to actually go forth and set up a contract with us. So, there’s also that very real effective you know, like you’re setting a foot into the industry by providing a model that’s open source.

[0:14:51.5] HC: How do you go about developing your product roadmap for a foundation model? Is it any different than how you’d roadmap another type of AI product?

[0:14:59.2] ZM: So, let’s start by saying that I can take a stab at answering this question but I am by no means an expert. I say developing the roadmap depends a lot first of all on understanding where the need of the potential customers are. Right now, our roadmap looks like something like releasing the first prototype of not only, you know, histology-only foundation model.

But actually going to something that’s going to be multiscale, so we can onboard our first clients and then in collaboration with, you know, these first clients, these privilege partners essentially develop more used cases and have a long-term vision whereby having built these connections, we’re going to hopefully have a foundation model that will be the reference model in a way that we’ll underpin biomedical discoveries that are made by AI models and AI agents.

[0:15:49.4] HC: So, in going beyond pathology, you know, you’ve talked a little bit about how you think about what comes next with the different scales. Is there a next step that you can talk about or how do you think about what comes beyond pathology?

[0:16:01.6] ZM: We do have a next step, we are actually in the process of putting together our first multimodal foundation model. In terms of which modalities we’re going to be connecting, I should probably not go into too much detail but what I can say is when you want to go multimodal what you need is not only the data that feeds, let’s say the two separate modalities simultaneous, like on their own.

So, you can’t just have the raw data for histology and the raw data for your second modality, you also need what’s called paired data, which is going to be data where you have, for let’s say, the same patient information and cross both modalities at the same time, and that data is obviously much harder to get access to and a lot of our choices are based in particular on what’s data exist and what technology exists to collect paired data.

[0:16:53.8] HC: That sounds like a whole new level of complexity but I’m excited to see what comes there.

[0:16:58.9] ZM: It’s definitely, it’s definitely been a challenge and honestly as a scientist, the challenge that comes from multimodality is really a fascinating one. So, it’s been a lot of fun scientifically for me to be tackling this kind of problems.

[0:17:13.1] HC: Are there any lessons you’ve learned in developing foundation models that could be applied more broadly to other types of data?

[0:17:18.5] ZM: Well, are you thinking about other types of data in biology or more broadly? Although, now that I think about it –

[0:17:24.8] HC: Yeah, any lessons that could be applicable to our listeners here regardless of what type of data they’re working with?

[0:17:30.6] ZM: Well, I’d say this is a little bit of you know, my soapbox in general but when you’re working with data, you want to understand what kind of structure you’re trying to capture in the data. In our case, it means understanding what there is in biology that is signal, how we think we can capture it, how we can build a model’s architecture, specifically to work with the structure that exist in biology.

The structure you have in biology, let’s, you know, think simply, very simply about protein sequences. So, these are chains of amino acids, you can think of a protein as a very, very long word, essentially but the way that certain – the positions of certain letters are going to influence others is not going to be the same as the way that language works, for example. You have to take that into account when you’re building a foundation model.

In a lot of ways, you can just, you know, take an off-the-shelf model and kind of hope that the data is going to fall, the structure will appear naively but that will never be as good as what you can get if you build something more bespoke, and then again, you know, this is something I’ve said before but I’ll say it again, infrastructure is so, so important. Building in the resiliency to the problems that come from having train, to train something as large as a foundation model is something that’s incredibly important.

When you’re training a foundation model, you will have hardware failure. You have – you rely so much on hardware, it’s no longer, you know, a possibility, it’s a certainty, and building from the ground up while knowing that this will happen so that you actually take that into account as you build up your code base, as you decide which hardware to run on, you can’t afford to not to do it.

[0:19:09.7] HC: So, that’s a great deal of engineering challenges that go onto this.

[0:19:13.5] ZM: Oh yes.

[0:19:14.6] HC: And finally, where do you see the impact of Bioptimus in three to five years?

[0:19:18.5] ZM: Well, I like to think you know, and of course, I could be wrong that you know, in the next couple of years, what we’re going to see is that we’re going to see foundation models and a particular Bioptimus’ foundation models become a reference for biology tasks, right? So, underpinning in a lot of ways, all future AI-driven biological discoveries. We’re starting up in the biomedical space with pharmaceuticals and other biomedicine.

But we have already seen that we’re getting interest from other sectors as well but if I were to think, you know, specific applications that listeners might be aware of or that might be of interest, I’d say something like one thing I would really love to see is seeing Bioptimus’ foundation models used to create things like AI virtual cells, digital twins to help pharma traits more quickly, accurately.

And honestly, in cost-effective ways as well to understand which drugs will have what kind of effect on different types of people as well, and you know eventually, the end goal is really, to have other AI companies and pharmaceutical companies integrate the foundation models that we’re building into their AI engines, and using our foundation models in all methodological projects that target downstream tasks in medicine.

[0:20:35.9] HC: This has been great. Zelda, I appreciate your insights today, I think this would be valuable to many listeners. Where can people find out more about you online?

[0:20:43.6] ZM: Well, we have a website, Bioptimus.com, and you can find a lot of information there.

[0:20:48.1] HC: Perfect, thanks for joining me today.

[0:20:50.5] ZM: Thank you for having me.

[0:20:51.3] HC: Alright everyone, thanks for listening. I’m Heather Couture and I hope you join me again next time for Impact AI.

[END OF INTERVIEW]

[0:21:01.3] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend, and if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]