Foundation Model Series: Creating Small Molecules for Drug Discovery with Jason Rolfe from Variational AI

Building on the trends in language processing, domain-specific foundation models are unlocking new possibilities. In the realm of drug discovery, Jason Rolfe is spearheading innovation at the intersection of AI and pharmaceuticals. As the Co-Founder and CTO of Variational AI, Jason leads a platform designed to generate novel small molecule structures that accelerate drug development. In this episode, he delves into how Variational AI uses foundation models to predict and optimize small molecules, overcoming the immense complexity of drug discovery by leveraging vast datasets and sophisticated computational techniques. He also addresses the key challenges of modeling molecular potency and why traditional machine-learning approaches often fall short. For anyone curious about AI’s impact on healthcare, this conversation offers a fascinating look into cutting-edge innovations set to reshape the pharmaceutical industry. Tune in to find out how the types of breakthroughs we discuss in this episode could revolutionize drug development, bring new therapeutics to market across disease areas, and positively impact lives!

Key Points:

An overview of Jason’s background and how it led him to create Variational AI.
What Variational AI does for the small molecule domain for drug discovery.
How they use foundation models to predict and enhance the design of small molecules.
Defining small molecules, their appeal, and an overview of Variational AI's data sets.
What goes into training Variational AI's foundation model.
The computational infrastructure and algorithms necessary to process this data.
Challenges of predicting molecular potency against disease-related protein targets.
Various ways that Variational AI’s foundation model underpins everything they do.
Evaluating progress: balancing predictive success with experimental validation.
Lessons from developing foundation models that could apply to other data types.
Jason’s funding and research-focused advice for leaders of AI-powered startups.
The transformative impact of Variational AI’s technology on drug development.

Quotes:

“Rather than forming individual models for specific drug targets, we're creating a joint model over hundreds, eventually thousands of drug targets.” — Jason Rolfe

“Data quality is essential. In particular, if you're drawing from multiple different data sources, frequently, those sources aren't commensurable.” — Jason Rolfe

“If you don't have a proven track record where people are already throwing money at you, it is very challenging to try to bring a new technology from the drawing board into commercial application using venture funding.” — Jason Rolfe

“Whenever you're developing a new technology or product, you need to test early and often. Some of your intuitions will be good. Most of your intuitions will be a waste of time – The more quickly you can distinguish between those two classes, the more efficiently you can move toward success.” — Jason Rolfe

Links:

Variational AI
Variational AI Blog
Jason Rolfe on LinkedIn

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. This episode is part of a mini-series about foundation models. Really, I should say domain-specific foundation models. Following the trends of language processing, domain-specific foundation models are enabling new possibilities for a variety of applications with different types of data, not just text or images. In this series, I hope to shed light on this paradigm shift, including why it’s important, what the challenges are, how it impacts your business, and where this trend is heading. Enjoy.

[INTERVIEW]

[00:00:49] HC: Today, I’m joined by guest Jason Rolfe, Co-Founder and CTO of Variational AI, to talk about a foundation model for small molecules. Jason, welcome to the show.

[00:00:49] JR: Thank you very much for having me, Heather.

[00:01:00] HC: Jason, could you share a bit about your background and how that led you to create Variational AI?

[00:01:05] JR: Sure. I’ve been working on generative machine learning since about 2006. Initially started working in obviously some older domains of generative machine learning, like loopy belief propagation. Since then, I’ve worked on some unconventional computational substrates. I looked at machine learning as it could be potentially manifested in the brain. As well as developing machine learning algorithms that are compatible with quantum computers. Specifically, adiabatic quantum inhalers.

About 7 years ago, I began to notice that a really powerful potential application domain of machine learning and, specifically, generative learning, is drug discovery. You have this problem where there’s a wide variety of different proteins in the body which mediate disease. For instance, they’re over-expressed or under-expressed, or overactive or underactive, and you need to either inhibit or activate them in order to return to the body to something closer to healthy homeostasis.

Searching over the space of potential small molecules, which could mediate these interactions, gave rise to the inhibition or activation of these proteins, which will resolve the disease state, is an extremely difficult problem. Which, it looks a lot like the sort of problem that’s solved by stable diffusion, for instance, where you provide a textual description and you ask the system to generate an image that matches that textual description.

In the drug discovery domain, you have a description of the sort of activation and inhibition of proteins in the body that you need in order to construct an effective drug. And you need to search over the space of small molecules in order to find one that’s going to realize this effect. This seemed like a really powerful potential application of the general technique of generative learning. And in order to fill what seemed like a vacuum, a need, we created Variational AI.

[00:03:03] HC: And so, what does Variational AI do?

[00:03:05] JR: We’re focused primarily on the prediction and optimization of molecular properties for the purpose of drug discovery. Again, diseases are often mediated by proteins which are either too active or not active enough. So, you want to either inhibit or activate them to return them to something closer to the healthy state. You need to find small molecules that do that. The space of small molecules over which you want to search is roughly 10 to the 60. This is a truly astronomical number. More than the number of water molecules in all the oceans of the world.

There’s no way that you’re going to just enumerate a search space and try every molecule in that search space. If you do that, you’re going to be looking at roughly 0% of the possible small molecules. Just as if you were trying to find an image that satisfied a particular textual description. If you tried to enumerate all the possible natural images, there’s a roughly infinite number of possible natural images. You never do that through an explicit search. You need to construct a latent search space and then optimize over it directly. This is exactly what stable diffusion does. And this is what we do for the small molecule domain for drug discovery.

[00:04:20] HC: Would you tell me a bit more about how you develop these models and how you structure it? I believe foundation models is part of the solution here.

[00:04:28] JR: Yeah. When we say foundation models, we mean that we’re training on basically all of the available data for a large variety of drug targets. Rather than forming individual models for specific drug targets, we’re creating a joint model over hundreds, eventually thousands of drug targets. This is analogous in the image domain to instead of making a detector for horses, and a separate detector for cats, and a separate detector for cars, you want a joint model over all possible image classes. This works really well on, for instance, the ImageNet data set, which has roughly a thousand examples for each of those thousand classes. Whereas if you were training individual models on just one image class, the performance would be much worse.

We’re trying to make a foundation model for drug discovery over the space of small molecules in which we are creating this joint model over all of these different targets and target classes in order to leverage generative transfer learning in order to bring all of that data together for the purpose of obtaining the best performance on each individual target.

[00:05:38] HC: What does the data you’re working with look like? How do you describe small molecules for the model?

[00:05:44] JR: For the benefit of the audience, when I say small molecule, this is in contrast to something like antibodies or protein. Antibodies and proteins have huge numbers of atoms in them. Thousands or maybe tens of thousands for a large protein. When we’re talking about small molecules, these are the sort of conventional pharmaceuticals. Small molecules are especially desirable because they can be taken orally. Whereas, generally, anything that’s protein-based, if – your digestive tract is designed to break up proteins and turn them into the basic building blocks of your body.

If you just swallow a protein, it gets digested. It doesn’t function as a drug. Small molecules are often orally bioactive. So, you can just take them as a pill, which is obviously very desirable. Our data sets contain information on both what small molecules are synthesizable as well as their pharmacological properties.

In terms of synthesizability, the organic chemistry reactions that exist and that are reasonably reliable and efficient are only capable of synthesizing a small fraction of the theoretically possible small molecules. In particular, a lot of the small molecules that exist in nature, in the body or in other organisms, these are formed through enzymatic pathways that are very difficult to replicate in a test tube. We’re trying to focus on those small molecules that are easily and efficiently synthesizable.

Part of the data set is giving the system huge numbers of examples of what makes a compound synthesizable. And the system is actually extremely effective at learning that and constraining itself to synthesizable chemical space. The other main domain of information that we need to incorporate into the model is the pharmacological properties. The degree to which it inhibits all of the potential drug targets as well as the properties of the molecule in their distribution and elimination from the body. Obviously, if your drug doesn’t reach the site of action, it’s not going to do what you want. If it then piles up in the body to toxic levels, there’s obvious problems there.

In terms of the potency information, the degree to which the molecules inhibit individual targets, you actually need to know about a huge range of different targets. Because in addition to having the desired effect at the specific target of interest, either activation or inhibition, you don’t want to influence any of the other proteins in the body since that will lead to side effects, which are the primary reason that drugs fail in clinical trials.

We assemble a large data set both on synthesizability as well as all of these pharmacological properties. And our foundation model is basically a joint generative model, a multimodal model encompassing both the molecular structure of these synthesizable molecules, how the atoms are arranged, as well as those pharmacological properties.

[00:08:45] HC: Where do you get all this training data in order to build the model?

[00:08:48] JR: There are a variety of sources that you could turn to. When you look at the academic literature, people attempt to make models based upon publicly available data deposited in databases like PubChem or ChEMBL. While these databases can be pretty extensive, they also tend to be extremely noisy both in the form of unreliable experiments. The experiments to evaluate these pharmacological properties are actually pretty finicky and just sort of intrinsically noisy. And a lot of the data that’s publicly available, it’s collected as cheaply as possible. And as a result, there’s a lot of noise in it. A lot of false positives. A lot of false negatives.

We in-license data commercially from commercial vendors and then we apply to it an extensive suite of proprietary cleaning and filtration techniques in order to scrub it as much as possible and get down to a core of data that’s actually reliable and usable.

[00:09:50] HC: We’ve talked a lot about the data that goes into building this foundation model. But what else does it take? What kind of computational infrastructure? What type of algorithms? Is there any kind of scale on – scale of the size of compute and data that go into this that would be worth talking about?

[00:10:08] JR: Yep. Compute and algorithms are definitely the right things to be talking about. I’d actually like to focus first on algorithms because I think that’s part of the story that’s often lighted over. The algorithms have been extremely refined for application domains that are prominent in the academic literature for which there exist large, well-curated standard data sets which can then drive research by setting a fixed benchmark that people can beat in order to get their papers published.

These sorts of data sets and, as a result, these sorts of models exist in domains like images and natural language. As a result, there are algorithms that are extremely tightly refined for these application domains. The algorithms that work best for natural language or that work best for images do not work well both from a theoretical perspective and empirically on small molecules.

One of the key distinctions here is that small molecules are neither linear chains of elements like the words in a sentence or a paragraph nor are they square grids of elements like the pixels in an image. Instead, they’re graphs with loops, and the sort of architecture that you’d want to use to process a graph with loops is intrinsically different than the sort of architecture that you would apply to either a chain or a square grid.

There are a couple algorithmic types that intuitively seem very well suited. The most prominent of these being graph neural networks. Graph neural networks have achieved reasonable success in some domains. They’re surprisingly ineffective in predicting the potency of small molecules, which, as I hope we’ll discuss in a later question in more detail, it’s an extremely difficult problem. Much less straightforward than you’d imagine.

There’s definitely an aspect of developing novel algorithms specifically designed for the problem domain. Both the fact that you have these graphs with loops as the fundamental structure that you’re operating on. As well as the fact that you’re trying to predict and then optimize against the potency of these small molecules, embodied as the graphs of atoms and their bonds, against specific protein targets. And this is a sort of data which is perhaps unlike the image categories, for instance. This is definitely one aspect.

The other aspect you asked about was the amount of compute. And here, you’re actually in a fairly fortunate position in that the amount of data that’s available for these drug discovery problems, while it is relatively large in – it meets the minimum threshold required to apply these strong deep learning and generative modeling techniques. At the same time, it’s significantly less than you have in the natural language domain, for instance. You can build a pretty effective model on a single GPU rather than having to parallelize your model across tens or hundreds of GPUs, which just makes the problem more accessible to a smaller startup like Variational AI in contrast to a giant company like Google.

[00:13:24] HC: You’ve alluded to some of the challenges already in building this foundation model. Are there other challenges? And which of these is kind of the greatest one that you’ve encountered?

[00:13:33] JR: I think I want to follow up on my answer to the last question and just talk about some of the difficulties of predicting the experimental potency of these small molecules against the various protein targets that mediate disease. Because that is a key aspect of why this is such a difficult problem. In order for a small molecule to activate or inhibit a protein, you need to take into account the fact that the small molecule and the protein are really floppy so they exist in an ensemble of different states. The degree to which the protein wraps itself around the small molecule is extremely impactful in governing how tightly they interact and how big the impact is.

You need to take into account entropy. You want a large collection of good states rather than just one really strong binding pose. And there’s water all around and the water is constantly jiggling and forms extremely strong interactions with the small molecules and the proteins. And that needs to be taken into account in a first principles approach.

People try these first principles approaches using techniques like molecular dynamics and binding free energy evaluations. But they’re not especially accurate. And they take a day of GPU time to evaluate a single small molecule against a single protein target. It’s not possible to scale these up to the sizes required to do the sort of drug discovery that’s actually required.

One of the challenges of building the foundation model of solving this problem with machine learning is that this is such a difficult problem particularly from first principles. The foundation model takes a more phenomenological approach rather than – the analogy I would draw is in the image domain. You could imagine trying to figure out what a cat is by modeling physics, and chemistry, and biology, and just building up a bottom-up notion of what makes a cat a cat. That is not how machine learning works.

Machine learning works directly on the pixels and you give it some labels, and on that basis, it’s able to discover the relationship between the pixels and the category labels. And we believe and we are seeing that that is an extremely effective approach for small molecule drug discovery. And that’s exactly where the foundation model is fitting in. The foundation model is realizing this phenomenological mapping between molecular structure and the potency and other pharmacological properties rather than trying to do everything through first principles.

[00:16:08] HC: How are you currently using your foundation model? And how do you plan to use it in the future?

[00:16:12] JR: The foundation model is the basis of everything that we do in terms of our commercial interactions. Customers come to us with the pharmacological properties of the drug that they want. They want to inhibit one particular protein. They want to avoid interacting with selection of other proteins. There are particular constraints on the way in which the molecule enters and is distributed to the body and then eliminated from the body.

We use the foundation model to search optimize over a latent space, a hidden space representation of all synthesizable drug-like molecules in order to find those points in that search space that are predicted to have exactly the properties that the customer wants. And then we project back from that search space into the space of molecular structures in a matter that’s very analogous to the way stable diffusion works. But in a system that has been designed from the ground up to deal with small molecules and their pharmacological properties rather than images.

[00:17:13] HC: When you’re evaluating your foundation model and evaluating the progress you’ve made, how do you know whether it is good? How do you know whether one version of it can solve problems better than a previous version?

[00:17:25] JR: This is actually a really tricky question in the domain of drug discovery. In machine learning in general, the way in which you do this sort of thing is you’d create a train test split by randomly selecting, let’s say, 10% of your data points to hold out from the training data and then evaluate on in the test data. If you have solely a generative model, you’d look at the log-likelihood of those held-out elements of the test set. If there’s also some sort of discriminative aspect, you’d want to check that the properties of those held-out test elements are correctly predicted.

In the drug discovery domain, doing this sort of random train test split doesn’t work at all. When medicinal chemists are trying to discover a new drug, they use a sort of intuitive optimization approach in which, starting from the previous best molecule that’s been found, they make a bunch of small perturbations, see which perturbation leads to improved properties, try to sort of intuit a relationship there, a trend, and then follow that trend line. They do this iteratively over and over and over again, which is part of the reason why drug discovery takes years and this stage costs many millions of dollars.

As a result, the data that they generate comprises sequences of very closely related molecules. And the structurally closely related molecules often have very similar pharmacological properties. If you make a random train test split, you end up splitting these, they’re called SARs, structural activity relationship, series across the train in the test set. You’ll have structurally extremely similar molecules. They might differ by just a few atoms in the train set versus the test set.

If you merely memorize the train set and then when queried on the test set, pick the properties of the most similar element of the train set, you’re already basically at the limits of accuracy of these models. But this doesn’t capture the actual drug discovery task where you want to move into new regions of chemical space on which you don’t have any experimental data.

If you create a bad train test split, your models are going to look like they perform wonderfully on these retrospective tests. And then when you actually try to use them, they’re going to fall flat on their face. And this has actually been the experience in industry using these models more or less.

When we evaluate our models, we are extremely careful to construct train test splits that reflect the use case that we care about where the elements of the test set are sufficiently dissimilar from any element of the train set. So, you’re actually measuring the critical generalization that is necessary for the system to be commercially and industrially useful.

[00:20:16] HC: That gives you a way to measure whether your model is good and whether you’re making progress. But in figuring out what to do next, how to improve your model further, how do you identify that with foundation models?

[00:20:28] JR: Obviously, we look at the output of the model on the test set. And on the train set, we look at training curves to see if there’s any sort of instability, for instance. But, actually, the technique that we find often gives the greatest insights is when we use our models for optimization. When we are searching over this latent or hidden space for points that have particular predicted properties. And then looking at what molecules those points correspond to.

And this probes the weaknesses of the model. It drills in towards the exact points where the model is, for instance, overconfident and is making a structurally impossible prediction. We can then go back into our data. We can go back into the structure of the model. We can look at the overall dynamics of the model in order to try to understand why it’s choosing these obviously bad molecules. But by doing the optimization, rather than just looking at the test set, this really pushes the model to the limits. It finds exactly the places where it is most likely to be making inaccurate predictions, which enables us to make the most impactful changes to the model.

[00:21:42] HC: Are there any lessons you’ve learned in developing foundation models that could be applied more broadly to other data types?

[00:21:49] JR: Definitely, we found that data quality is essential. And in particular, if you’re drawing from multiple different data sources, frequently, those sources aren’t commensurable. What the experimental methodology that’s applied in one data source from one study is different than the experimental methodology applied from a different source, in a different study. And this leads to systematic differences between the labels on those data points, which needs to be accommodated and accounted for. Otherwise, you end up just injecting overwhelming amounts of noise into the model.

You can easily imagine that similar sorts of things could occur even in the visual domain. If the cameras that are used or the lighting that’s used differs systematically between data sources, this has an enormous impact on the pixels in the image. Even if to a human observer, it doesn’t seem as if there is an obvious difference in the content, the subject of the image.

The other thing that we’ve really focused on, which I think is applicable across many other domains, is that the machine learning architecture needs to be specifically designed for the problem that you’re trying to solve. If you’re not trying to solve a natural language problem, if you’re not trying to solve an image problem, then probably you’re going to need to modify the architecture in order to take that into account.

[00:23:24] HC: Thinking more broadly about your role as a founder, is there any advice you could offer to leaders of AI-powered startups?

[00:23:31] JR: One of the things that I did not realize going in but becomes evident as you try to interact in this sort of space is that venture capitalists aren’t really positioned to evaluate novel technologies that are still in development. If you have a technology where you can demonstrate a proof of concept where you’re ready to start interacting with initial customers and you can show that initial market traction right off the bat, that’s the level at which VCs are able to analyze and are willing to make a bet.

Coming from the more technical side, I and many other people, I’m sure, are used to spending a lot of their time trying to evaluate the merits of a new technology, of a new technique, and trying to project its application and its impact down the line. And they are personally making a bet that using a modified version of the following algorithm, I think I will be able to solve the following problem.

And sometimes you’re right. Sometimes you’re wrong. Sometimes you’re in between. That’s not the sort of prediction and the sort of bet that venture capital is designed to make. It’s designed to make business bets rather than fundamental technology bets. It seems to me like in the cases where VCs actually are making a fundamental technology bet, what they’re really doing is betting on the founder and the founders’ team rather than betting on the technology. If you don’t have a proven track record where people are already throwing money at you, it is very challenging to try to bring a new technology from the drawing board into commercial application using venture funding.

Some other things that I’d highlight for people who aren’t coming so much from the academic machine learning side is that machine learning really is its own branch of science. While there certainly are many possibilities to create a new company using proven basic technology from machine learning, merely applying it in a new business context to a new business model but not really stretching the capabilities of the technology in any fundamental way. If you are trying to develop a truly new capability, this requires people who have experience doing machine learning research. Not people who merely applied existing technologies. And the set of people who were able to do that is significantly smaller than the set of people who are great at engineering solutions using well-developed technologies.

Related to that, something that long struck me in terms of the disconnect between what you see in the machine learning literature versus its impact on startups and the commercial and industrial ecosystem is the almost sort of amount of noise in the academic literature. There’s hundreds, thousands of papers published every year. Not hundreds, thousands. Maybe tens of thousands. Each introducing some sort of novel technique. Very few of these stand the test of time.

If you look at the current state-of-the-art solutions, the set of basic techniques that allow them to achieve their truly exceptional results, it’s tens of fundamental components. And, of of course, hyper-parameter tuning and tweaking on top of that. But in trying to pull from the academic literature in order to create a new product, you need to be very cautious and judicious in what aspects you take. The proven reliable things are probably going to be a much better investment of your time than the hot new paper that was just published in [inaudible 00:27:20].

And the final thing that I’d say is that, whenever you’re developing a new technology or product, you need to test early and often. Because some of your intuitions will be good. Most of your intuitions will be a waste of time. More effort than they’re worth. And the more quickly you can distinguish between those two classes, the more efficiently you can move towards success.

[00:27:45] HC: And, finally, where do you see the impact of Variational AI in three to five years?

[00:27:49] JR: In three to five years, we hope to be radically accelerating early-stage drug discovery for both small biopharma companies as well as the large pharmas. Interacting throughout the entire spectrum from the initial identification of small molecules for new targets that have pretty good potency in selectivity but are still far from an actual drug candidate. Moving through the entire process of refining those molecules until you’re ready to start animal testing and, eventually, human clinical trials. Eventually, we’d perhaps even explore developing our own drugs for out-licensing to large drug companies for clinical trials.

[00:28:31] HC: This has been great. Jason, I appreciate your insights today. I think this will be valuable to many listeners. Where can people find out more about you online?

[00:28:39] JR: They can go to our website, variational.ai. Or they can check out our blog, variationalai.substack.com.

[00:28:47] HC: Perfect. Thanks for joining me today.

[00:28:49] JR: It was great speaking with you, Heather. Thank you very much for having me.

[00:28:51] HC: All right, everyone. Thanks for listening. I’m Heather Couture. I hope you join me again next time for Impact AI.

[OUTRO]

[00:29:01] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. And if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]