Foundation Model Series: Empowering Drug Discovery with Rick Schneider from Helical

AI is transforming drug discovery by making biological data more accessible and actionable, bridging the gap between complex sequencing data and real-world therapeutic breakthroughs. As Rick Schneider puts it, it’s all about leveraging powerful models to “build use cases that matter and bring value.”

In this episode of Impact AI, we hear from the CEO and Co-founder of Helical to find out how bio-foundation models are transforming pharmaceutical research. Rick shares how Helical’s AI platform enables drug discovery by leveraging biological sequencing data without requiring companies to build their own models from scratch. He also reveals the challenges of working with high-dimensional biological data, the power of model specialization for specific therapeutic areas, and the growing role of open-source AI in healthcare innovation.

Whether you’re in biotech, AI, or simply curious about the future of medicine, this episode offers invaluable insights into how AI is shaping the next generation of drug discovery. Tune in today!

Key Points:

Introducing Rick, his engineering background, and Helical’s mission.
The challenges of leveraging biological foundation models for drug discovery.
Understanding biological sequencing data and its complexities.
Key technical challenges: messy datasets, long-range dependencies, and model architecture.
How Helix, Helical’s mRNA foundation model, competes with industry leaders.
Three key factors in building biological foundation models: data, compute, and talent
The shift from narrow AI to general-purpose AI in pharma.
Benchmarking and evaluating foundation models for different use cases.
Commercializing Helical’s platform through partnerships with pharma companies.
Insight into the role of open-source AI in advancing biological research.
The future of biological foundation models: scaling up for greater impact.
Rick’s vision for Helical as the backbone of in silico pharma labs.

Quotes:

“The question is, how do I leverage [powerful biological foundation models] and – build use cases that matter and bring value? Helical is building a therapeutic area, an agnostic AI platform that is empowering single-cell RNA and DNA bio foundation models for drug discovery.” — Rick Schneider

“In bio, you can still innovate on the architecture side and not simply [with] the scale of the models. It's not simply by throwing more compute at the models that you get to the very best outcomes.” — Rick Schneider

“Be okay with being different in your approach and accept [that you will] be contrarian to certain things.” — Rick Schneider

Links:

Rick Schneider on LinkedIn
Helical
Helical on GitHub
Helical on Hugging Face
Introducing Helix-mRNA-v0
Helix-mRNA
Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. This episode is part of a mini-series about foundation models. Really, I should say domain-specific foundation models. Following the trends of language processing, domain-specific foundation models are enabling new possibilities for a variety of applications with different types of data, not just text or images. In this series, I hope to shed light on this paradigm shift, including why it’s important, what the challenges are, how it impacts your business, and where this trend is heading. Enjoy.

[INTERVIEW]

[0:00:49] HC: Today, I’m joined by guest, Rick Schneider, CEO and Co-Founder of Helical to talk about a platform for biological foundation models. Rick, welcome to the show.

[0:01:00] RS: Thanks for having me.

[0:01:01] HC: Rick, could you share a bit about your background and how that led you to create Helical?

[0:01:05] RS: Yeah, sure. Look, I’m actually an engineer by training. I always wanted to be a doctor. I was actually about to go to medicine school with my co-founder, Maxime and Mathieu. And last minute, Maxime and I decided to actually do engineering while Mathieu went ahead and became a cardiologist. That’s a bit on my background. I quickly transitioned into data science and AI, worked in big tech, German scale-up company and doing a lot of different things, product, go-to-market. Yeah, that’s what I truly enjoy doing all of those things. The entrepreneur was always there. And for me, it was obvious to do something in the biology space and came across those bio foundation models. We will definitely talk more about those in a second.

And yeah, together with Mathieu and Maxime, we saw the huge potential these models have and the impact they will have on the pharma world to do things such as target identification, biomarker discovery, and many other things. And yeah, we saw the models, we saw the potential they have, but we saw how tough it is to use those models. And so we decided to go down the path of building a platform for those models.

[0:02:08] HC: Tell me more about what Helical does and why it’s important.

[0:02:12] RS: Yeah, sure. I mean, there are more and more bio-foundation models coming out. Almost weekly now, you see new models that are very good and always better, bigger. And the biggest challenge lies more in the post-training world. How do I take this model and specialize it to my own data? How do I specialize it to a specific task with things such as sign curing that is more known in the AI world but can be complicated to understand on the computation biology side? The question is really how do I leverage those powerful models and actually do and build use cases that matter and bring value.

And Helical here is building a therapeutic area, agnostic AI platform that is really empowering those single cell RNA and DNA bio foundation models for drug discovery. We focus on use case, which is target identification, biomarker discovery, patient stratification, you name it. Really, end-to-end drug discovery pipelines, enabling companies to leverage those powerful models. Opportunistically, we also built our own bio models. We built Helix, it’s an mRNA foundation model, and we do that if there’s no good open-source alternative, then we go down that path.

[0:03:18] HC: You mentioned biological data. What does that mean? What types of data is it that you’re working with that these models you’re built upon? And how do you go about getting that data?

[0:03:28] RS: Yeah, it’s a good question. I think we mainly work or mainly focus on sequencing data, as such as DNA and RNA data, also on bulk and single cell expression data. And the future of those models actually lies in the multimodality, right? I mean, how do I combine all of those different modalities together? How do I create abstraction layers on top of them?

Our approach is quite unique. We don’t buy or build our own data, at least not at this stage. We are really agnostic to that, and we bridge the gap from existing data to use cases. This means that our customers, big pharma companies, have generated very specific data on a specific question or scientific question they want to answer, and we leverage the data to specialize an existing foundation model or their foundation models to this very specific data and use case. We really reach that gap without ourselves needing to spend large amounts of budgets on data. That being said, we aggregate all types of open-source data that is out there. And that’s also what we did for the pre-training of Helix, right? We simply aggregated all sources of mRNA data that we are able to create quite large data sets in the end.

[0:04:38] HC: For those who aren’t as familiar with biological data, these types of data you mentioned, these are high-dimensional data sets, I imagine?

[0:04:46] RS: Yeah. I mean, they are simply huge in size, right? I mean, 3 billion nucleotides for DNA sequence. And also, it’s quite messy. Because when those experiments are being generated, it depends a lot on environmental factors as well, right? In what batch has the sequence been produced? And hence, you have a lot of complexities that you would not have with simple text data that you need to take care of. The quality, I would say, varies a lot across different data sets. And you definitely need to do a lot of pre-processing to harmonize that data before you can actually use it for training. And so the gap is definitely larger using this biologically, a bit more messy, high-dimensional data versus text data.

[0:05:29] HC: What are some other types of challenges you run into in dealing with these datasets and building models based off of them?

[0:05:35] RS: Yeah. I mean, in the end, when you train a foundation model on text, it’s quite straightforward, right? What you call a token, you want to choose the right granularity for that. What is your model actually trying to predict? Well, in text, it’s easy, right? It’s a word. If you want to predict something, what you’re going to do is predict the next word. Well, in biology, it’s a bit more complex than that. How do you choose the right granularity for those tokens, right? Is it at the nucleotide level? At the single letter of a sequence, right? Or how do you choose that resolution?

Then the second thing is that data is long-range. In the end, DNA data, again, 3 billion nucleotides. Something that happens at the very beginning of the sequence or a change in one of those nucleotides at the very beginning could impact something at the very end of the sequence. All of a sudden, your model needs to be capable of reading through a book with 3 billion words if I take the comparison of LLMs. And that’s super complex. And for training models, you need to take that into account.

I think that’s what we did when we trained Helix. We chose what you call a hybrid architecture between a transformer-based model. So those are the typical foundation models you see in the ChatGPTs and so on, and something called a state-based model. That’s a bit more new models that we’re seeing more and more. They are particularly efficient in long context range tasks. It works really well for this biological data.

And this mix of those both approaches is actually quite interesting because you need lower budgets, because those states-based models, they escape really well. It’s quite cheap to apply them to larger contexts. And in the end, by mixing both, we were actually very successful. Helix today is beating really the existing foundation models in the messenger RNA space, such as Sanofi’s model, Johnson & Johnson’s model. Those were the big ones that were out there.

And so even with very low budgets, we were able to compete on those really larger models given the fact that we used this efficient combination. And actually, the paper got accepted through ICML, which is a top AI conference. Very cool to see that, in bio, you can really still innovate on the architecture side and not simply from scale of the models. It’s not simply by throwing more compute at the models that you get to the very best outcomes.

[0:07:55] HC: It sounds like some of the research has been done on LLMs that may have formed the basis for what you have, but you really needed to understand the data and how to adapt these and do some things differently in order to really make it work on biological data.

[0:08:10] RS: Yeah, exactly. I mean, the architectures are the same, right? I mean, you’re not going to reinvent the wheel in the sense that data transformer is a transformer, a state space model is a state space model, and they work particularly well for predicting sequences. It makes a lot of sense to stick with them. Obviously, you could go ahead and reinvent new architectures that work even better for biology and I’m sure it will be this way. But for the sake of efficiency, yes, you keep the same architectures and you really innovate a lot on how you make that sequencing data actually work for those models and not being text but being biological sequencing data. And that is quite a challenge. And I think, as I mentioned, there are many design choices you need to take. How precise should the model be? What resolution should you get? How do you make it long range? All of those things in biology are a lot more important and a lot more challenging to answer.

[0:09:02] HC: What does it take to build one of these foundation models with biological data? You’ve mentioned a number of the pieces here. But put it all together and does this take – what quantity of data does it take? Different things like that?

[0:09:15] RS: Yeah. I mean, in the end, it’s really the same in the entire foundation of the game. And it comes down to three things. Data, compute, and talent. Do you have the right data to train it, or do you have enough of that? And now we’re looking at data sets of 50 to 100 million cells. So things start being interesting. Chan Zuckerberg Institute is working on the on the billion cell project. Those data sets are going to explode, and it’s actually led by the open source world. That is quite interesting. That is really for building the foundation model itself. Data becomes way more important towards the prediction level. When you really want to take a model and predict something very specific, there we speak about very rare data samples where often companies have maybe 20, 30 samples lying around. But that is really the question of how do I specialize the model.

But back to the building the foundation model, the second thing is compute. Obviously, it’s better to have a lot of that or become inventive on the models you use. We went for, as I’ve mentioned, this combination of the state space model and the transformer, making it way, way, way more efficient. And hence, with quite limited budgets, we had quite good results.

But I think the third one is the most important one still and I don’t think it will change anytime soon. It’s talent. So you attract the right people working on those foundation models. And for biology, you really only have a handful of people that can do those things because they need to sit at the intersection of AI and really be aware of the latest things happening in AI, architectures and so on, but then have the full biological understanding. Because otherwise, you do not build something meaningful. You will simply take some biological data and then throw the latest AI techniques at it. But this won’t work. It’s all about those choices I’ve just mentioned. And then you can only do that if you are really well-informed about the intersection of biology and AI, and you know how both impact each other. And I think that is where we at Helical are particularly strong. And I think we really double down on getting the best people sitting at this intersection. And for me, this has been absolutely key in the experience of building Helix.

[0:11:25] HC: What are some of the challenges you’ve encountered in building a foundation model?

[0:11:29] RS: I mean, honestly, it goes back to what we just said. How do you get that talent? How do you get that compute? How do you get the data? How do you aggregate it? I mean, one of the hands-on types of biggest challenges, operational challenges, was really aggregating the data sets across all the different open source atlases that are out there. This just took tremendous amount of time to getting that right, getting everything aligned to the same format, and getting it ready for training. In the end, I would say that’s where a majority of our time was spent.

Architecture choices. How do you choose the right architecture for this? This is obviously also a big challenge. Because once you run or launch a training round, well, you have to wait for a week or two to see if it has worked, right? The choices you make have really strong impacts. And that’s very much true for all types of foundation models. And you often have to wait days, weeks, or months to see if your choices were correct. Obviously, you have some metrics to evaluate if the training is going well. But in the end, it’s only once the training is done that you see the true result that you have been able to generate. And that is very challenging, I have to say. A lot of uncertainty.

[0:12:38] HC: I’ve seen foundation models provide and assist projects in a variety of different ways. In your experience, and with biology data in particular, how do foundation models help you solve problems? Why is this such an important piece for direct discovery?

[0:12:54] RS: Yeah. I mean, I think the big game change of foundation models is the fact that they are bridging the gap from what was known as narrow AI. AI trained for very specific problems to more general-purpose AI, foundational models. And they differed dramatically from these traditional AI models where algorithm in the end needed labeled data to learn not to predict something very specific. If you take that in the world of LLMs, imagine you want to predict a review that a customer leaves, if it’s positive or negative.

Well, traditionally, the narrow AI, you need a labeled data set where humans actually look at all those reviews, tell you if they are positive or negative, and then the model learns that relationship. But you could only do that for one specific task. If there’s a new task you want to do, well, then you have to start from scratch and do new labeling. With foundation models, this is completely different. ChatGPT, the Mistrals, and so on, they are not specifically trained on one task. They read so much text that they develop and you hear an understanding of the text, and this could then automatically predict from that reading if a review is positive or negative.

And the bio foundation model uses the same concepts, basically. It is trained on raw biological sequencing data, DNA, RNA, for example, and it learns the meaning captured in those sequences. A bit like, let’s say, learning the language of life to a certain degree. And that’s a huge game changer for pharma companies because labeled data is often a bottleneck. If you can come with a model that has seen already tons of data, did not meet the labels, well, then you have a super strong foundation for your prediction.

And what we actually do with our platform that is completely model agnostic, well, we leverage those pre-trained models and then we have pharma companies to specialize them. So how do they leverage their, I don’t know, 20 labeled data samples that they might have to create a true expert model on their very specific task at hand? And so we work with big pharma doing exactly this for use cases such as target identification, biomarker discovery, patient stratification, and it works really, really well. Because, again, we are completely model agnostic. We will always take the best model for the task and data at-hand. And I think it’s that approach that yields those great results. Today, there’s not one model wins it all. And I have a strong conviction this will not be the case.

[0:15:20] HC: You’ve incorporated not just your own foundation model, but a variety of others into your platform. And you mentioned you take the winner. Does that mean for each experiment, you need to try out each foundation model and just pick the best, or is there an algorithmic way to identify which one to use, or how do you approach this?

[0:15:41] RS: That’s a very good question. We have put tons of effort in building an evaluation benchmarking framework where we evaluate all models that we integrated through our platform automatically against all the tasks that we consider as being biologically relevant. And we do that across all different therapeutic areas where we have data. This means that, in the end, you’re able to evaluate a model for its capacity to solve a biologically relevant task in a given therapeutic area. Let’s say oncology, for example. And this approach allows us to very quickly choose the winners for each use case. Because once you know that they are good at similar biological tasks, similar to the one you want to solve for your very specific type of data, well, then you’re good to go. You know that this model will win. You can obviously replicate that on your own data as well and then check out if that’s still the case on your own data, because there’s a lot of variability in biological data. And often, the models on unseen data might perform differently or behave differently than what’s expected. We often recommend to read one, the benchmarking on local data, pharma companies’ data as well.

[0:16:45] HC: How do you plan to commercialize your platform?

[0:16:47] RS: Yeah. We already commercialized it. In the end, it’s licensing out the platform. And we really partner with pharma companies directly where they get access to the platform and obviously to us and our station biologists who work on very specific use cases with them. And this way, they can leverage the from across the company, across the different drug discovery programs at the different stages to really cover, yeah, all types of use cases with the platform, basically.

[0:17:14] HC: What role does open source play with your technology?

[0:17:18] RS: Honestly, I think it plays a huge role, not only for our technology but for biology, generally speaking. Obviously, we build a part of our platform on open source models. By the way, our model library is completely open source ourselves. I mean, you can go on our GitHub and simply work with our platform where you get all the models that are out there into one library, making it very easy to use. There was obviously the need for us to give back to the community.

But then I think what you will see in biology, it’s just the beginning, really. We know so many labs and companies working on incredible models, as we speak, actually, and many of them will be open source. Take the ARC Institute, for example, the Collison brothers, they just released Evo 2. That’s a 40-billion parameter model. For those who don’t know what that means, that size, I mean, it’s simply the largest open source model ever released, not only in biology. This includes all the text models that are out there. They are all smaller than what the ARC Institute has released with Evo 2. And it works super well. It has been trained on only one human reference genome and, obviously, many other species. And it still predicts mutations better than specialized models.

And also, in our case for Helix, we trained it and then we open sourced it. And this works super well. Today, Helix has more than 30,000 downloads. It’s used by researchers. It’s used by companies. And for us, it creates a lot of notoriety. It helps us to show how well those models work. And obviously, our platform then leveraged Helix to really build powerful use cases on top of it. And I think that is where our value proposition lies. We think that models themselves will be a commodity.

[0:18:59] HC: What does the future of foundation models for biology look like?

[0:19:02] RS: Yeah. I think, as I’ve mentioned, it’s really just the beginning. I think you will see more and more coming out. They get bigger, they get better. And this is what will happen. I think you will especially see the models getting larger in size, something that in text is changing a little bit. Because in the end, small models have the advantage of being more efficient when you do the inference.

Well, in biology, you don’t care so much, right? If a task is solved in two days, but it is basically finding a new drug, then it could have taken even a year. It would have been 10 times faster what’s happening today, right? In the end, big models will still be able to capture more knowledge. I think that the models will be bigger and bigger, which makes the challenge of using them larger and larger as well.

[0:19:48] HC: Is there any advice you could offer to other leaders of AI-powered start-ups?

[0:19:53] RS: That’s a good question. I think, generally speaking, what has worked well for us is really be okay with being different in your approach accept to be contrarian to certain things. In our case, it was really the choice for an AI platform versus a tech bio where we have our own pipeline. And it worked super well because we had a strong conviction in our worldview and it happened that our worldview happened to become true and by being a bit different and contrarian at the beginning. But that was obviously challenging at the beginning because not everybody saw what we saw. Once that vision came true – and by the way, you do not control the timeline of when your vision comes through. But if it comes through, you’re well-positioned to take advantage of that. Yeah, be different and accept to be contrarian.

[0:20:37] HC: And finally, where do you see the impact of Helical in three to five years?

[0:20:40] RS: We will continue to be on a growth trajectory. And I am convinced that we will be the company that powers all the in silico labs of the entire pharma industry. We really make it possible for pharma companies to leverage all existing models and do all of their experiments in silico through those models, and that is what we want to power.

[0:21:01] HC: This has been great. Rick, I appreciate your insights today. I think this will be valuable to many listeners. Where can people find out more about you online?

[0:21:08] RS: Yeah. I mean, you can just go to our website, helical-ai.com, or go to our GitHub and just look for Helical AI on GitHub. There you will see that there are many example notebooks where you can try out those bio foundation models. It’s very, very accessible, very easy to use them. And from there, you can get started for free with our open source package.

[0:21:27] HC: Perfect. I’ll link to all of that in the show notes. Thanks for joining me today.

[0:21:31] RS: Thank you so much for having me.

[0:21:33] HC: All right, everyone. Thanks for listening. I’m Heather Couture, and I hope you join me again next time for Impact AI.

[OUTRO]

[0:21:42] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share it with a friend. And if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]