In the traditional paradigm, it can take up to ten years for a drug to come to market. For this episode, I am joined by guest Aaron Morris, Co-founder and CEO of PostEra, to talk about using AI to accelerate medicinal chemistry and bring cures to patients faster than ever before.

Aaron breaks down the medicinal chemistry process and explains how PostEra applies machine learning to drug discovery. The data landscape within drug discovery is particularly challenging and today, we learn about PostEra’s approach to gathering data, the data sets they build from, and how they find new uses for project-specific data. Hear about the importance of model interpretability and how to get a competitive advantage as an AI-powered startup.


Key Points:
  • Aaron Morris’ background in mathematics and how it led to the creation of PostEra.
  • The scientific disciplines involved in developing a drug.
  • PostEra’s focus: building the world’s most advanced ML platform for medicinal chemistry.
  • Aaron explains the process of medicinal chemistry.
  • How PostEra applies machine learning to the drug discovery process.
  • The challenging data landscape within drug discovery and the data sets PostEra builds from.
  • PostEra’s approach to gathering data, and how they use it.
  • The challenge of finding new uses for project-specific data.
  • How PostEra validates its models.
  • Why PostEra makes its models less black box and how they go about it.
  • The importance of model interpretability and how PostEra develops interpretable ML.
  • Aaron’s advice for other leaders of AI-powered startups.
  • His vision for PostEra’s impact in the next three to five years.

Quotes:

“Though being reasonably competent on the machine learning side, I had a very, very steep learning curve when it came to getting up to speed with drug discovery chemistry and the applications of AI in that domain.” — Aaron Morris

“Drug discovery is going from biology to chemistry to medicine and PostEra squarely focuses, at least for now, on the chemistry angle. Our main focus is to build the world’s most advanced machine learning platform for what is referred to as medicinal chemistry.” — Aaron Morris

“PostEra is really the first AI company to pioneer machine learning across all three stages of how to design a molecule, how to make the molecule, and how to select the optimal set of molecules to test.” — Aaron Morris

“There is a lot of project-specific data that gets generated, and often what that means for PostEra is we’re having to be very inventive about how we try to get the most out of data even if it is not relevant.” — Aaron Morris

“If you want to build defensibility as a company, you have to have more than just innovations on model architecture.” — Aaron Morris

“Your typical drug today is taking anywhere between eight to ten years to come to market and obviously, we want to really accelerate that.” — Aaron Morris


Links:

Aaron Morris on LinkedIn
Aaron Morris on Twitter
PostEra
PostEra on Twitter


Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.
Foundation Model Assessment – Foundation models are popping up everywhere – do you need one for your proprietary image dataset? Get a clear perspective on whether you can benefit from a domain-specific foundation model.


Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.

[INTERVIEW]

[0:00:34.1] HC: Today, I’m joined by guest Aaron Morris, Co-founder and CEO of PostEra, to talk about accelerating medicinal chemistry. Aaron, welcome to the show.

[0:00:42.6] AM: Thanks, it’s really good to be here.

[0:00:44.3] HC: Aaron, could you share a bit about your background and how that led you to create PostEra?

[0:00:48.3] AM: Yeah, absolutely Heather. So I come from a mathematics background, I have always loved mathematics and after graduating from university, well firstly, I went into investment banking. I was actually using mathematics and coding and ultimately, machine learning algorithms to try and optimize training strategies in the stock market, which has got absolutely nothing to do with biotech, with developing drugs.

However, I continue to maintain a good friendship with a long-term friend of mine, Dr. Alpha Lee, who actually was a mathematician with me at Oxford in the UK and he actually went on to pioneer a lot of the advances in AI for drug discovery over the years throughout 2016, ‘17 and ‘18 and well, in 2019, we sat down together and I felt that I could help him take the academic advances his group had made at Cambridge and form a company around them.

So though being reasonably competent on the machine learning side, I had a very, very steep learning curve when it came to getting up to speed with drug discovery chemistry and the applications of AI that domain.

[0:02:01.2] HC: So, what does PostEra do and why is this important in developing better treatments?

[0:02:06.5] AM: Sure, I guess I like to frame it as a kind of high-level transition between scientific disciplines and then focusing on exactly which discipline PostEra focuses on. So in a very high level developing a drug starts out in biology. You are seeking to identify what is the biological mechanism that is broken in the human body that is causing a given disease.

That is a biological problem. You then transition from biology into chemistry, which is how do you develop a chemical solution, we call it a pill typically, and you know, finding a way to not only bring about a cure for the disease but also ideally, leave no harm to the patient as well and so you’re balancing the safety of the chemistry with the efficacy of the chemistry.

And then finally, it becomes a medicine problem, where you are seeking to identify the correct patients who are appropriate for the clinical trial and seeking to measure the actual health benefits that your drug should provide.

So drug discovery is going from biology to chemistry to medicine and PostEra squarely focuses, at least for now, on the chemistry angle. Our aim focus is to build the world’s most advanced machine learning platform for what is referred to as medicinal chemistry.

[0:03:32.8] HC: So what is medicinal chemistry specifically? Can you drill down a bit more there?

[0:03:37.6] AM: Yeah. So medicinal chemistry is the process of developing often novel chemical matter. So the chemical structures that we’re all familiar with from chemistry classes in school, you got a bunch of carbons and other atoms are various types and fluorenes, and the idea is you try and develop the chemical compound to satisfy a series of properties.

Those properties can be between kind of a list of 10 to 15 of them that typically broken in two camps that first camp being referred to as Pharmacodynamics, which is how does the drug affect the body, and then Pharmacokinetics, which is how does the body affect the drug and so it’s really this multiple amateur optimization to get a chemical compound that satisfies these 10, 12, 15 properties that you need for the FDA to sign off for you to run a clinical trial.

So that is the process of medicinal chemistry is finding a molecule that is potent in the body, is finding the molecule that is safe in the body, and ultimately within the word safe, it is a lot of very more finely grained constraints that you have to satisfy.

[0:04:50.8] HC: So the next big question is, how do you apply machine learning to medicinal chemistry, where do you start?

[0:04:57.2] AM: Well, the medicinal chemistry process can be summarized as a cycle of designing compounds, making compounds, and the testing compounds. Then you get the data back from your experiments and then you design some new ones, you make them and you test them.

So for PostEra, we’ve really focused on applying machine learning at each three of these stages. In fact, traditionally within the AI for drug discovery space, the main emphasis was really on the first stage, how do we get algorithms to dream of molecules? Regenerative chemistry, think about ChatGPT but for molecules, and although this is an important problem, it’s often not the weight limit in step in the actual whole cycle time.

In fact, the way limiting studies often just making the molecule in a lab and a lot of frustrations with the early AI for drug discovery approaches was that the algorithms would just generate molecules, which were near infeasible to ever practically make in a lab and so PostEra is really the first AI company to pioneer machine learning across all three stages of how to design a molecule, how to make the molecule, and how to select the optimal set of molecules to test.

So that you’re getting the most informative data back at each stage, leveraging a lot of work in active learning, which you can imagine is a very fruitful avenue in this drug discovery cycle.

[0:06:21.7] HC: So, what kind of data are you working with and applying machine learning? What does that data look like?

[0:06:27.5] AM: I think the data sources within science in general, certainly drug discovery does look quite a bit different from the kind of traditional computer vision NLP approaches and data sets that you have in other industries. So firstly, the data sets are typically much, much smaller, you are rarely dealing with “big data” except maybe for some related pertaining tasks, and secondly, the data is inherently much noisier.

Just the nature of science and the complexity of experiments means that you often have variability within your assuage within your experiments. So you’re dealing with low data often, you’re dealing with noisy data very often and to make things even worse, if you think about what is the unique cost to generate a new data point in each industry, that unique cost in my industry, it’s very, very high.

If I want to get a new picture of a cat or a dog to train my computer vision classification model, it’s not too hard to find another picture of a cat or a dog but it is in fact, very expensive to generate new experimental data. So the data I would say landscape within drugs, so at first it is very challenging but that data can consist of a variety of things. It can consist of things like what are the properties of certain existing compounds?

And then a machine learning model, for example, a graph neural network is able to extract the salient features of a chemical structure and understand how that translates to a given chemical property such as whether it’s going to be soluble in the bloodstream, whether it’s going to be metabolized too quickly or too slowly there, what the permeability is going to be like and so these often are the datasets that we deal with.

There are then additional data sets, which refer to synthesis, something we care about deeply at PostEra. What is the probability that this given reaction is actually going to work? So these are the different type of data sets that we’re often building off in PostEra.

[0:08:31.4] HC: So especially given that is so difficult to go and get new data, how do you go about gathering data and do you need to annotate it as well? [0:08:39.2] AM: So there’s certainly a plethora of approaches and we’re unashamedly adopting her putting like most of them and then the first is yes, there certainly is public data and commercial data that you can access. The problem is there’s often very, very noisy and generative from a plethora of different sources.

Now, it doesn’t necessarily mean that you need to further data away. Often, what it can do is affect to help warm start your machine learning model by at least embedding some level of chemical understanding as a pretraining task before you then finetune the model on what you might call more reliable and campaign-specific data.

Now, in that case, you can generate your own data, we do that in PostEra. That is expensive and it takes time. So alternatively, you can actually try and get access to other people’s proprietary data. This is difficult to do but it is something PostEra has done where we have struck deals with a pharma company where effectively, we get access to that proprietary data in return for some services that we offer.

So I would say, it’s a mix of getting the best out of what is in the public and commercial domain, generating your own proprietary internal data, which grows over time, and then third, to seek and destroy partnerships where you can access larger data sources that maybe more establish pharmaceutical companies have.

[0:10:01.7] HC: So once you’ve obtained this data, does it need to be annotated for your purposes? What is it you’re trying to predict from it and is that something that needs to be annotated?

[0:10:10.4] AM: I would say, less annotation but more cleaning. There is a lot of nuance around how an experiment is run and therefore, whether you really want to combine that data with another similar experimental data point within the same model output or whether you want to try and win a more multitask approach, where you treat the outputs as two separate heads on your deep neural net but allow the weight to be shared.

So you’re trying to share some combined learnings. These are the type of cleaning and decisions that you need to try and make upfront and the things you’re trying to predict are typically, those set of properties that you really need to try and predict, “Is this molecule going to be soluble? Is this molecule going to be potent? Is this molecule going to get metabolized by leaving too quickly?”

Is this molecule going to interact with receptors in the body that you don’t want them to interact with such as, I guess once refer to as HERD, which can generally lead to cardiac arrest? You can also, in these models, think about when we’re designing these synthetic routes, you’re effectively designing a recipe for molecule, and there, your machine learning task is to predict, will this reaction work.

If I combine A and B, will I get C coming out of the end of the reaction? So ultimately, it’s when you combine all these things together in a more multiparameter approach that you really begin to accelerate drug discovery.

[0:11:37.7] HC: So you have mentioned a number of challenges already, so the small size that your data sets, the noisy nature of it that is difficult to go and collect more data. Are there other common challenges that you’re dealing with?

[0:11:49.9] AM: I think often the challenge is that some of the data that you will generate for a given drug you’re working on will be totally useless for the next drug that you try to work on. So there is a lot of I would say, project-specific data that gets generated, and often what that means for PostEra is we’re having to be very inventive about how we try and get the most out of data even if it is not relevant.

So an example might be let’s say you are working on a cure for COVID and effectively, the measure of potency there is, “Will my drug reduce the amount of virus in the human body?” and you generate a lot of data as such oppose it has on how to kill viruses. Now, if you then transition to the next drug discovery project and you are trying to deal with something like Parkinson’s or schizophrenia or a human-based target rather than an external based-target like a virus, that data set of potency is really not helpful anymore because you’re dealing with a completely different biological mechanism that you are trying to target.

So again, this is one of the challenges in terms of trying to get the most out of very project-specific data.

[0:13:18.0] HC: How do you go about validating your models?

[0:13:20.8] AM: At the end of the day, we have to rely on real data from experiments. There are benchmarks out there that the academic community have constructed and we find them valuable and we use them and in fact, we’ve created and contributed our own benchmarks to the academic literature. However at the end of the day at our walk what we care about is advancing a drug toward patients.

To understand the true validation is our model predicted that this molecule would be most soluble than the last molecule. We are going to have to run a solubility test now. The hope obviously though is that the use of machine learning means you need to run significantly less experiments and reduce the redundancy rate but the way we validate our models is we run the experiment and we get the data back and we iterate.

[0:14:14.7] HC: How do you approach making your models less black box? This is something that I saw mentioned on your website but I guess first of all, what was the purpose of making things less black box and how do you go about it?

[0:14:25.4] AM: Yeah, this is a great question because again, if you have to think in our domain, compare examples in other industries for example like text autocompletion when you and Gmail and Gmail suggest a completion of your sentence, the user isn’t usually thinking about why the model has suggested that autocomplete. They’re just basically know it all, they tab and they choose to use it.

Whereas in my industry, the technology that we developed is being put into the hands of chemists who are responsible for a pretty large budget of money that they’re going to have to spend on experiment. So they really want to know why my model has selected A over B just because of the cost of failure and subsequently, that means for us model interpretability is a core part of how we think you get adoption by your traditional scientist and chemists in using AI techniques and I can talk to you about how we go about developing interpretable ML. But the core reason of why we care about it at all is because our end user, the chemists and the scientists who are using our technology, really want to get the most conviction and value out of the model and explainability is a big part of that.

[0:15:48.8] HC: So it’s really about that trust and the trust probably comes from providing the biochemical explanation for why this makes sense, is that fair to say?

[0:15:58.8] AM: Yeah. It’s to the extent that we can highlight to the scientists and the chemists impose to that. Often what salient feature of a chemical structure has resulted in the model giving a certain prediction. So for example, if you use PostEra’s technology internally, whenever we are designing molecules, we always [inaudible 0:16:24.5] ensure that there is a synthetic recipe, there is a way to make that molecule.

Often our chemist will want to score hundreds of thousands of molecules in any given go and rather than simply just ranking them top to bottom like, “Hey, this is really easy to make and this guy at the bottom is nearly impossible to make” what we’ve been able to do is identify which parts of the molecule are the specific hotspots that cause this particular molecule to be hard to make and that again just a visual inspection that our model can spit out allows the chemist to just seem like, “Ah, okay. Yes, I understand why the algorithm’s initial ranking is particularly high or low.”

[0:17:09.7] HC: So how do you go about creating that interpretability? Is this similar to pixel attribution in the computer vision domain or is it something else?

[0:17:18.7] AM: So what you can do for example within our domain, we tend to treat chemistry as a graph structure by a default, that there are multiple other ways that we can featurize chemistry but one of the ways that we take like graph compilation on that will work is they are looking for kind of like context outside of the given atoms or the molecules.

So rather than just seeing oxygen overflowing or carbon or hydrogen, it sees the broader context of how that atom is placed and then effective, when you are passing it through the neural net, you can identify by the knocking out nodes within the neural net at the end, find out which parts of the actual graph structure is the most relevant and pertinent for the actual prediction at the end.

So these graph convolution neural nets have been advanced to try and graph pull with chemistry, which is invariant in a lot of ways. If you rotate it, if you translate it like it shouldn’t change the output. So you have to be very careful about how you featurize the graph structure but once you’ve got a good featurization, you can then just try and isolate which substructures within the molecule are driving the type of prediction you’re getting at the end and then you can just highlight that to a chemist.

[0:18:41.5] HC: It sounds like in a lot of ways it is similar to computer vision. You represent your structure instead of pixels and an image. It is a graph structure but you’re working to get from the model information about, which parts of the inputs, which parts of the graph structure in the case of computer vision, which pixels were most associated with your output, is that right?

[0:19:03.4] AM: Yeah, I think an analogous way that in traditional computer vision, you might get an output that is classified as a cat and then you can highlight the pointy ears rather than floppy ears or whatever it is. Yeah, I think it is a very fair analogy with what we’re trying to do in chemistry.

[0:19:20.4] HC: Is there any advice you could offer to other leaders of AI-powered startups?

[0:19:24.8] AM: I think for us, when it comes to competitive advantage Heather, though PostEra is broadly recognized as with it being on the cutting-edge of AI for chemistry, I think if you want to build defensibility as a company, you have to have more than just innovations on model architecture as much as that is valuable and important and a huge part of what we do.

For me, I really think about AI as not just the innovations in machine learning but the combination of that with the really sophisticated engineering effort that goes into scaling these models up because no chemist cares if you can assess one or two molecules, we need hundreds of millions of molecules to be assessed, which is a huge engineering component that goes into making good machine learning.

There is obviously proprietary data that will protect your company from just simply other people tinkering around in pie torch long enough until they match your architecture and then there is also these partnerships with domain experts, which again I think is really unique in the science industry, I think contrasts with the large language model race that you are seeing now and the assumption being given that you have access to sufficient compute.

Google and Facebook and Open AI and whoever else is working on these large language models, if everybody is training on the corpus days of the Internet, you should all eventually converge on similar models. So you have to have something beyond that whether it’s proprietary data or domain expertise or even just very high-class engineering to where it differentiates you and give you a competitive edge as an AI company in your field.

[0:21:03.8] HC: Finally, where do you see the impact of PostEra in three to five years?

[0:21:07.7] AM: Well, for us in three to five years, we really hope that we have shown a genuine proof case of bringing a drug to patients, or maybe at that stage it will likely be in late-stage clinical trials and be able to show the impact of AI on the timeline, on the quality of the drug that we have developed. Again, just for reference, your typical drug today is taking anywhere between kind of eight to ten years to come to market and obviously, we want to really accelerate that.

Even if you can cut several years of that timeline, it is a huge benefit to patients and so that is our vision. Our vision is to really pioneer what does a modern 21st-century AI-first biotech look like and demonstrate its utility in the field by bringing cures to patients and putting drugs in clinical trials and in a much faster and more efficient manner than the traditional paradigm.

[0:21:58.6] HC: This has been Aaron. Your team at PostEra is doing some really interesting work for drug discovery. I expect that the insights you shared will be valuable to other AI companies. Where can people find out more about you online?

[0:22:10.1] AM: Well, we do have an active website at postera.ai. There are the traditional LinkedIn and with the pages as well. We’re not cool enough to have a TikTok or Snapchat, so just check out LinkedIn and Twitter. We do have a newsletter as well, it comes out every couple of months. Yeah, I am always very open and happy if people reach out to ask further questions.

[0:22:30.9] HC: Perfect. Thanks for joining me today.

[0:22:33.2] AM: It’s great, thank you, Heather.

[0:22:34.4] HC: All right everyone, thanks for listening. I’m Heather Couture and I hope you join me again next time for Impact AI.

[END OF INTERVIEW]

[0:22:45.2] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share it with a friend. And if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]