In today’s episode, I sit down with David Healey, the Vice President of Data Science at Enveda Biosciences, to discuss searching for new therapeutics in nature. Enveda Biosciences is a cutting-edge biotech company revolutionizing drug discovery processes using automation and machine learning. It has a unique approach involving mapping the vast unknown chemical space in nature to identify potential therapeutics. David is a data scientist with a knack for machine learning in life sciences. He has expertise in deep neural networks, computer vision, natural language, and graph models, including a solid background in drug discovery, cheminformatics, metabolomics, and experimental biology.
In our conversation, we talk about the role of machine learning in drug discovery and the importance of developing treatments. We discuss using big data for drug discovery, the challenges and opportunities of the field, the hurdles of working with mass spectrometry data, and Enveda Biosciences’s approach to research. Hear how Enveda Biosciences finds the best talent, why drug discovery is an exciting field, and much more.
- David’s background leading up to his role at Enveda Biosciences.
- What Enveda Biosciences focuses on and their approach to drug discovery.
- Learn about mass spectrometry, tandem mass spectrometry, and chromatography.
- The role of machine learning in biosciences and how it is used with mass spectrometry.
- Enveda Bioscience’s applies machine learning differently.
- He explains the challenges encountered when working with mass spectrometry data.
- Find out the value of large language models and other advances in the field.
- We unpack the niche nature of the work Enveda Biosciences is doing.
- Overview of the different types of experts that are working at Enveda Biosciences.
- David shares what recruiting approaches have been most successful for the company.
- Advice that David has for other AI-powered startups.
- He tells us about the impact he wants Enveda Biosciences to have in the future.
[00:00:03] HC: Welcome to Impact AI. Brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.
[00:00:33] HC: Today, I’m joined by guest David Healey, Vice President of Data Science at Enveda Biosciences, to talk about searching for new therapeutics in nature. David, welcome to the show.
[00:00:44] DH: Thanks, Heather. I’m happy to be here.
[00:00:45] HC: David, could you share a bit about your background and how that led you to Enveda?
[00:00:49] DH: Yeah, absolutely. So, my background was trained as a biochemist and a biologist. I came up in the lab and worked with a lot of microbes and a lot of lab techniques. During my Ph.D. is when I really became interested in joining computation with lab work. So, my Ph.D. was at MIT in the biophysics group there. It was around 2014, and ImageNet was very big, as you know, and there was starting to be a lot of hype around machine learning (ML) and a lot of people thinking about how to use neural networks in particular. I really sort of got caught up in that at the end of my Ph.D. And after my Ph.D., joined the sort of wave of people that were joining tech to do applied ML and to really, really test the limits of what that technology could do.
So, I was in – I spent a number of years in the tech industry doing that. At the time, it was not biology, and there was not actually at the time, a lot of demand for joining biology with machine learning, in particular. That all kind of changed for me when Recursion Pharmaceuticals started. Recursion Pharmaceuticals is one of the first kind of cohorts of tech bio companies that started with the premise of let’s gather a ton, a ton of data, specifically to use it for machine learning, and specifically, to tailor that data for the algorithms and see what we can get out of that, for drug discovery purposes.
I joined Recursion, very early, as one of their earliest data scientists. Over the next three and a half years or so, got a really front-seat view of kind of the rise of thinking about how to use data, big data, and machine learning to better drug discovery, and to find therapeutics faster. I was there for three years. I came away with a lot of impressions about the importance of being able to control the data generation process, what is important in drug discovery, and the problems that really need to be solved there.
After that experience, I spent some time sort of looking for what the next big application is. I taught a survey course at the university in deep learning for biotech, and I spent a lot of time consulting and talking to VCs and founders of companies, and looking for sort of the next big thing. The next big place is that machine learning could really make a big difference in drug discovery.
Around that time is when I became reintroduced to Viswa. Viswa is the founder and CEO of Enveda. He had been at Recursion. I’d known him at Recursion in the early days, and he had left to start a company around looking for drugs in nature. I started talking to him a lot about that, advising, and consulting. And he had this framework of, let’s be very systematic about it. Let’s use mass spectrometry in particular, to figure out what the active molecules are in nature, and cataloging them all, that really spoke to me. When I got the data, I just saw that it was such a good application for machine learning, in particular, it really seemed like a once-in-a-lifetime kind of opportunity to make a big difference, right at the crux of something that’s very important to all drug discovery. So, I joined Enveda. That was about a little over two years ago, to build out the data science and machine learning team there.
[00:04:42] HC: So, I guess that focus on finding drugs in nature and using mass spectrometry to do so, is the focus of Enveda. What does Enveda do? Is that the main focus or are there other aspects to it? Why is this important in developing better treatments? [00:04:58] DH: Yes, fundamentally, what we’re doing at Enveda is looking for active molecules in nature. What that involves is trying to learn what the molecules are that nature produces, and what they do. Our first application is drug discovery for that. Nature is a fantastic source of drugs. A lot of the drugs that are made, almost a third to a half of the drugs that are approved by the FDA each year, either are themselves natural molecules or were derived from natural molecules. So, it’s always been a big source of therapeutics.
And that makes a lot of sense because nature has spent thousands and millions of years evolving its chemistry to interact with biology. So, it’s a very good sort of starting substrate. But on the other hand, most of nature’s chemistry is still unknown. It’s not anywhere, historically been easy to tell what molecules are even in nature, what their structures are, and definitely not what their functions are.
So, what you have is this huge sort of like frontier, where all these drugs that have been discovered in the past, things like metformin, or aspirin, CVD, artemisinin, just all these drugs have come from basically a tiny, tiny, tiny sliver of the known space of the chemistry that nature takes. There are hundreds of thousands of species of plants alone, not to mention bacteria, not to mention fungus, mammals, sea sponges, and things like these, and they’re all unique, and they’re all making their own sort of battery of molecules to do things, to do things in nature. Historically, it’s been very difficult to access that, to figure out what those are. It’s not like gene sequencing, for example, which is sort of like now, if I have a plant, or I have, a tree or something, it’s fairly simple to kind of sequence the entire genome. I mean, that’s been developed over the last several decades.
But we’re at that point now, where you could kind of know what all the genes are. You can kind of know what all the RNA (mRNA) are and what the proteins are. But the chemistry, and by the chemistry, I mean, like the small molecule, organic molecules, those are sometimes called metabolites. Those are really unknown still. We know that there are tens of thousands of them in every sort of biological sample. Well, we don’t know what they are, for the most part. Mass spectrometry is kind of the way to get to that. The best way to get it is sort of understanding the composition, chemically, of nature. So, that’s where we’ve been focusing.
[00:07:58] HC: Those who aren’t familiar, what is mass spectrometry? [00:08:00] DH: Yes. So, mass spectrometry is a way to measure the mass of things, basically, fundamentally, really small things. What happens is, usually, it’s coupled with what’s called chromatography, which is that usually you’ll take a sample, and the sample may have 20,000 different molecules in it, say. And you separate the chromatography, sort of separates them out a little bit along a gradient that has to do with the chemistry of the molecules.
So, you might say this molecule is very polar. It will come into the machine early, and this molecule is nonpolar, so it will come into the machine late. There’s this sort of separation that comes before the machine that helps you sort of distinguish the molecules from each other. And then, in the machine, what the mass spectrometer does is measure ionize each molecule and then measure the mass of it in some way. You’ll be able to say, “Okay, I see there are 10,000 things here, and I know what their masses are. This one has a mass of 200 daltons. This one has 300 daltons.”
And then the second thing it will do, and this is called tandem mass spectrometry, is it will take where it has detected molecules, it will isolate that little window of parameters, and it will send those molecules in to be fragmented. The molecules will be pushed through a gas, they’ll break all apart into pieces, and then the machine will measure the mass of the pieces. So, basically, this is the data that you have. This is the data that you have to tell what the molecules are, as you get a readout that says, “This is a list of the masses of the fragments that came off of this thing, and this is their relative abundance.”
[00:09:49] HC: So, data, I suspect, as you’ve been talking about, is probably the opportunity for machine learning. So, what role does machine learning play in your technology? How do you use it with mass spectrometry data?
[00:09:59] DH: Right. This is really an exciting thing. Historically, it’s been very difficult to interpret those. I mean, this is one reason why most of nature’s chemistry is unknown is this fragmentation spectra, just lists of fragments and intensities. But there are patterns to them, and there’s structure to them. And the structure kind of resembles natural language, and then a specific peak or a specific mass, the meaning of that thing, in terms of what structure it actually represents, is only meaningful in the context of the other peaks that are there.
We’re using machine learning to sort of interpret the language of the mass spectrometry and in particular, to treat it like a natural language problem or like a machine translation problem. Those same kinds of transformer models and self-attention to learn what the meaning of the mass spec fragments are and how to interpret that into information that is useful for understanding whether it would be a good drug or not. So, that could be – primarily, that’s the chemical structure. What we want to do is know the chemical structure of the molecules, ultimately.
But that could also be relevant chemical information for medicinal chemistry. Like how polar is the molecule? How many rings does it have? What kind of atoms are in it? That sort of thing. And all of that is kind of learnable from the mass spec fragmentation. It’s a very new way of thinking about interpreting mass spectrometry. Historically, people have treated those spectra as kind of like a fingerprint in a fingerprint database, where what you’re looking for is to identify something that is already in a reference library somewhere. But we think that machine learning, in particular NLP models and the general machine learning that has been popularized very recently, that same technology can learn to interpret the language of chemistry and to output it in the language of chemistry. What we do is we put the mass spectra in as an input or condition. And as an output, we ask the machine to write the molecule’s structure, as, also as language, what’s called a smile string, is a string of tokens that represents the structure of a molecule. So, treating it as a machine learning sort of machine translation problem.
[00:12:25] HC: So, this is, I suspect, different than your typical supervised learning problem because you might not know what that structure is that you’re trying to output or?
[00:12:36] DH: Yes. That’s a great question. It’s largely unsupervised, but part of it is supervised. The reason why part of it can be supervised is that mass spectrometry has been around for a long time, and there have been a lot of – over decades, a lot of people have gathered mass spectra from known molecule structures and put those into databases. You can also generate more of those, which is what we do in-house, which is you can buy purified molecules, and run them through the mass spectrum, and see what the spectrum is. Then you have a labeled pair. You have the spectrum, and you have the structure of the molecule.
A problem in the world, though, is that that has only really been done for about 25,000 natural products in total. And there are hundreds of millions, probably, molecules in nature. So, the space that has been labeled that way is very small. We do have to get a lot of benefit from sort of self-supervised training, or weekly tables, or that sort of thing where we’re taking mass spectra – there are databases of mass spectra now, where nobody knows what those spectra represent exactly in terms of the chemical structures, but there are billions of spectra.
If you think about that, similarly to the kinds of language models that have been developed, where if you’re translating between two languages, the first thing you might want to start with is just feeding every instance of that language, every sentence, you can find about language through a language model, and learn the grammar and the structure of the language first. And then, you use the supervised learning, the labeled data, to learn the mapping between them.
So, we do a very similar thing as well, which is we can take the billions of spectra, unlabeled spectra that exist. There are very similar kinds of self-supervision, and weak supervision, to learn the structure of the spectra. And we can also potentially take all of the billions of known molecule structures and do the same things, or try to learn the language of the chemical structures and then use that relatively small number of labeled data to map between them.
[00:14:53] HC: What kinds of challenges do you encounter in working with mass spectrometry data?
[00:14:58] DH: Yes. Mass spectrometry data is super interesting. It’s challenging in many ways. One way that it’s challenging is that it’s very discontinuous. So, if I take some molecule, some drug maybe, a molecule, and I look at its mass spectrum, and then if I just change one little thing, one atom of that molecule, maybe or put like a slightly different functional group somewhere, all of the sudden, that changes the shape of the molecule, potentially, that changes the energetics of the molecule, the energetics of the bond. So, when it goes through the fragmentation process, it may come out with completely different facts. That has been a historically a very difficult challenge with working with mass spectrometry, which is the similarity between mass spectra does not correspond very well to the similarity between the chemical structures that generate this. So, we do a lot of work on learning better representations of the spectra so that we can compare them with each other, in a way that would better approximate the actual similarity of a molecule. So, that’s one way.
Scale is another thing. Mass spec has come a long way, even in just the last few years. There’s a lot more data being gathered, and a lot of the algorithms that have historically been used to process them are just not scalable enough to use when you’re trying to profile all of nature, for example. Or hundreds of thousands of organisms that each have tens of thousands of molecules in them. So, we’ve done a lot of work on being able to make this kind of data scalable, rework the algorithms, make it queryable, joinable, figure out whether the molecules are the same or different across different samples, that sort of thing.
In learning natural language models, and adapting those to mass spectra, that’s challenging as well, because mass spectra is not a language. They’re not ordered, like language. So, the same kinds of positional encodings don’t really make sense in the same way. Also, it’s not really tokenized in the same way as language. Masses can be very precise. These machines can tell you the mass of a molecule down to a thousand or sometimes 10,000th of a mass of a neutron. And those small, small, very precise numbers are very meaningful. You don’t really find that in natural language. So, adapting the language models to interpret mass spectra is also a challenge.
[00:17:36] HC: It sounds like there’s a number of challenges there, and nuances that you really have to understand the data in order to know what to do with there.
[00:17:42] DH: Yeah, exactly. Are there any specific technological advancements that made it possible to build this type of model now, when it wouldn’t have been feasible even a few years ago? Maybe the advancements in large language models, or maybe there’s something else that has made this possible?
[00:17:58] HC: Yes, 100%. One hundred percent in large language models (LLMs). This really is a great way to interpret the mass spectrum because you can learn the same kinds of contextual embeddings that language has. So, being able to piggyback off of all of the work that has been done over the last few years in natural language processing has been fantastic. The other thing is the machines. The machines have gotten a lot better at being able to very precisely measure a lot of molecules and distinguish them from each other. That’s really important for our particular case, because in our case, we’re taking very complex mixtures, and each mixture may have tens of thousands of molecules in it. And you need the machine to be able to distinguish those from each other. And that is an advance that has really taken off just in the last few years with a measurement of what’s called ion mobility, that allows you to sort of make inferences not just about the mass, but about the kind of shape, the three-dimensional shape of the molecule.
So, those two things have really enabled this. A third thing is the advent of large databases of mass spectra, unlabeled spectra. That didn’t exist just a few years ago, and that’s very recently, the community of people who use this kind of data has started compiling large databases that contain billions of mass spectra. And that’s what you need, right? That’s what you need to get these big, big language models and transformers to really realize their potential, is just an enormous amount of data, and that is a very recent development as well.
[00:19:42] HC: Yeah. I think you see that a number of fields are, especially when there’s some kind of unique data modality. If there’s a way to create public datasets so that any of those individuals or groups working on it have more to work with. That’s one of the things that they can transform it and transform the capabilities of machine learning on that data.
[00:20:02] DH: Yeah, it really can.
[00:20:03] HC: So, I imagine that your background in biochemistry makes you a very good machine learning and data science person to be tackling this. But what about other machine learning engineers and developers that are working on your team? If they don’t have a background in biochemistry, how are they able to adapt? How are they able to interact with those who do know more about the data, and how it is collected, and what it means?
[00:20:29] DH: Oh, that is a great question. Yeah, this is a really, really niche field. It’s one of the most niche fields that I’ve ever encountered in the sense that the number of people who are familiar with mass spectrometry, especially mass spectra for small molecules. And machine learning is just very, very small. We do seek out those people very actively. But you’re right that most people will have one or the other. So, we built a very interdisciplinary team, and we have people with strong backgrounds in machine learning that have come from big, big tech companies and worked on natural language and machine translation at Amazon, for example. We also have people who are very top in their field at understanding the mass spectrum themselves.
It’s a challenge, but we build the team to sort of force those people to have very close contact with each other. The other thing is that we have a lot of medicinal chemists. We’re a drug discovery company. So, we have a whole arm of the company that is developing the drugs for the clinic. And those people are responsible for choosing, actually, which molecules we go after. So, we’ve had to develop a very tight feedback loop where the machine learning team can develop models, and interpret them. And then, we can actually go into the plants and isolate, we can isolate specific examples and have the chemists isolate them and verify their structure via what’s called NMR, which is a fairly labor-intensive process. But you can do that for individual molecules.
So, that’s another way in which we’ve sort of brought the domain experts, the people that know the chemistry and the mass spectrometry, together with the domain experts in machine learning, and data science, and mass spectrometry data science, in particular.
[00:22:29] HC: Hiring for machine learning itself is quite challenging right now due to the high demand for professionals in this field. What approaches to recruiting and onboarding have been most successful for your team?
[00:22:40] DH: Boy, that is a great question. That is a great point. As I mentioned before, one thing we do is do a lot of sort of actively look for the people who are overlapping between mass spectrometry, natural products, and machine learning. Because it’s such a niche field, it’s relatively easy to find those people. So, we do that. The other thing is that in terms of attracting ML talent, it really has helped to have an extremely unique use case. A lot of the people that we recruit have been scientists before, and then joined tech, and are still interested in science and finding cases like this, where it’s structured like natural language.
So, if you’re a natural language processing (NLP) expert, you already kind of know the things that you need to do, and you should try with the data. But it’s on a completely different application, and the application is very high impact, right? This is striking at the heart of a kind of drug discovery, that could be very influential, and you’re working directly to turn into new drugs. Not very many people are doing it. So, that’s also appealing, I think, to machine learning researchers, which is that you can have an impact in the field in terms of being one of the very few people to sort of develop the key algorithms for the first time. It’s just a very interesting problem, fundamentally, from a data science perspective. I would say that we’ve benefited from highlighting those aspects of the problem.
[00:24:18] HC: Is there any advice you could offer to other leaders of AI-powered startups?
[00:24:22] DH: Yes. I mean, I think to some extent, the advice is going to be similar across a lot of different kinds of fields. One piece of advice is that startups are so reliant on the first few hires, right? So, really being deliberate about getting the best talent in the door at the very beginning, I think, is really crucial.
Another piece of advice might be, there’s an interesting thing about ML research, I think, which is very easy two years later, we have this kind of path-dependent place where you’re at, where you are not quite sure why you’re doing things the way you’re doing things, and it’s probably was a decision that somebody made a long time ago or an experiment that was run that had some results. A good piece of advice that I would give is to make sure to document the decision-making around the process, around the path to getting the research path, to getting the machine learning, so that it’s very clear what decisions were made and how you made them.
[00:25:38] HC: Yes. I guess, especially with a relatively high turnover of machine learning talent in this field, some of them only stay in the same role for a year or 18 months. So, two or three years down the line, there might not be somebody to remember why that experiment two years ago turned out that way and why that influenced the way you do things now.
[00:25:59] DH: Yeah, 100%. I think in machine learning, and in research, anyway, R&D, we’ve lagged behind software development, for example. It’s relatively easy to sort of trace the path of software with Git and various tools. I think there are a lot fewer tools and a lot fewer sorts of established best practices for having that same history with a research program.
[00:26:24] HC: It’s newer, so the tools they’re developing and the tools – some of the tools are being used, but they’re not being used consistently enough yet.
[00:26:34] DH: Yeah, that’s right.
[00:26:34] HC: Yes, I definitely agree with that. Finally, where do you see the impact of Enveda in three to five years?
[00:26:40] DH: Where at the place now at Enveda, where we’ve really developed a lot of capabilities, and we’re right at the point of massively upscaling in terms of numbers. Our goal is to kind of understand the map of the chemistry of nature as a whole, and that’s a very ambitious goal.
So, over the next three to five years, we would hope to kind of have a handle on a large swath of nature. And by handle, I mean, not necessarily knowing the exact chemical structure of every single molecule and every single organism. But sort of know what it looks like in terms of the kinds of families that the organisms have. And then, we also pair this with a large automated high throughput screening of bioassays.
So, we’d like to in the next three to five years, kind of have like a general sense, across at least the plant kingdom, of what kinds of families of molecules there are, which organisms are making them, and kind of what they do, to have that profile. As well, I think, from a drug discovery standpoint – drug discovery timelines are very long. So, in three to five years, you’re not going to get a drug into a clinic; that takes ten years. But we would like to have multiple programs. And we’re well on our way, I think, to this, having multiple new natural products in clinical trials.
[00:28:18] HC: That’s exciting. I look forward to following you and where things go within the beta. This has been great. David, your team at Enveda is doing some really interesting work in drug discovery. I expect that the insights you’ve shared will be valuable to other AI companies. Where can people find out more about you online?
[00:28:32] DC: Yeah, thank you. Online, we are at envedabio.com. That’s our website, and you can find links to recent papers, and blog posts describing our approaches, as well as descriptions of the pipelines and the diseases we’re working on there.
[00:28:50] HC: Perfect. Thanks for joining me today.
[00:28:52] DC: Yes, thanks so much. It’s been a pleasure.
[00:28:54] HC: All right, everyone. Thanks for listening. I’m Heather Couture, and I hope you join me again next time for Impact AI.
[00:29:03] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend, and if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.
LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.
Foundation Model Assessment – Foundation models are popping up everywhere – do you need one for your proprietary image dataset? Get a clear perspective on whether you can benefit from a domain-specific foundation model.