New research and technology are radically transforming cancer treatment, and, today, we find out how. I am joined by Genialis CEO and Co-Founder, Rafael Rosengarten to discuss his company’s mission to “outsmart cancer.” Genialis is revolutionizing cancer care by developing AI models that decode the biology behind different types of cancer and identify the most effective therapies for individual patients.
In this episode, we discover how Genialis’ innovative approach of turning RNA sequencing data into tumor phenotype classification is remolding the landscape of precision medicine. Rafael explains the company’s methods of handling the high dimensionality and sparseness of sequencing data while addressing bias issues, filling us in on why they use shallower artificial intelligence architectures for algorithm training and more. Join us as we explore the cutting-edge world of personalized cancer treatments that are shaping the future of oncology.
- Genialis CEO and Co-Founder, Rafael Rosengarten’s background; what led him to Genialis.
- How Genialis applies machine learning to help patients find personalized cancer treatments.
- Their collaboration with drug and diagnostics companies to deploy their models.
- How the models use RNA sequencing data to predict and classify tumor phenotypes.
- The challenges encountered when training models with sequencing data.
- Rafael defines sequencing data.
- Why RNA sequencing for clinical applications is considered cutting-edge.
- Genialis’ methods for handling the high dimensionality and sparseness of sequencing data.
- The various sources of bias and how they have addressed these issues.
- Why they use shallower artificial intelligence architectures for algorithm training.
- How the FDA’s regulatory process affects how Genialis develops and validates its models.
- The benefits the Genialis team has seen from publishing research articles.
- How they measure the impact of their technology.
- Rafael’s advice to other leaders of AI-powered startups.
- The hyper-commoditization of AI technologies.
- Rafael predicts the future impact of Genialis and shares his goals for the company.
“[Genialis applies] machine learning to try to help patients find the best drugs for their disease, to help realize the promise of precision medicine.” — Rafael Rosengarten
“The models are learning the fundamental biological nature of the disease. From that, we can extrapolate what the best intervention will be.” — Rafael Rosengarten
“Not all genes have detectable expression at once. Certainly, not all genes are going to be informative. We've built really beautiful software that allows us to aggregate these kinds of sequencing data, to process them in a very uniform way.” — Rafael Rosengarten
“It really is a pan-cancer model, even though it was trained on a data set that was just gastric cancer. And it works on RNA sequencing of all different chemistries, even though it was trained on microarray” — Rafael Rosengarten
“The key with algorithm training, of course, is to try to avoid what's known as overfitting.” — Rafael Rosengarten
“Every phenotype that our model predicts whether it's phenotype A, B, C, or D, has a different therapeutic hypothesis.” — Rafael Rosengarten
“AI technologies right now are becoming hyper-commoditized.” — Rafael Rosengarten
“It is still possible for small companies to come up with really innovative algorithms — but for the most part, it really matters how you deploy these technologies.” — Rafael Rosengarten
Rafael Rosengarten on LinkedIn
Rafael Rosengarten on Twitter
Genialis on LinkedIn
Genialis on Twitter
Talking Precision Medicine Podcast
[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine learning powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.
[00:00:33] HC: Today, I’m joined by guest, Rafael Rosengarten, CEO and co-founder of Genialis, to talk about targeted cancer treatments. Raphael, welcome to the show.
[0:00:44] RR: Thanks for having me, Heather.
[0:00:46] HC: Raphael, could you share a bit about your background and how that led you to create Genialis?
[0:00:50] RR: Sure, I’d be delighted to. My background is a life scientist. I actually joke and say that I’m a former jellyfishologist. Because in graduate school, I studied evolutionary genetics and was really interested in marine critters, and marine systems. Really, I just wanted to scuba dive. But throughout my scientific career and my academic training, I kept getting more and more interested in mechanisms, the why, and the how, or how biology worked. That led me more towards genetics and genes and molecules. Then eventually, I gave up scuba diving for doing molecular biology in a lab with clear plastic tubes.
I moved even deeper into things like data analysis, really crunching data to understand how biological systems worked. This led me to spend some time in synthetic biology where we were designing biological systems. There’s a saying that if you can design it, and it works, then you really understood it. It also led me to start to learn more about data science. Now, I’m not a data scientist practitioner but I did spend quite some time training around the edges so that I can have meaningful conversations with data science collaborators.
When I was a postdoctoral fellow at Baylor College of Medicine in Houston, Texas, back in the sort of 2011 to ‘15 period, I met a handful of really great data science collaborators from a Slovenian lab group. One of these Slovenia data scientists ended up becoming my co-founder. He had actually just started Genialis on the strength of some technology from his lab, and was looking for a US co-founder from a life science background who could help point the technology towards important problems. That’s kind of the origin story of the company back in 2015, ‘16 time period.
[0:02:35] HC: What does Genialis do, and why is this important in treating cancer?
[0:02:40] RR: That’s a great question. Again, the challenge was to figure out what is the most impactful problem we can solve with this emerging machine learning technologies that we were developing. That’s what Genialis does. We apply machine learning to try to help patients find the best drugs for their disease, to help really realize the promise of precision medicine. The way we do this is by building models that understand fundamental disease biology. And you mentioned cancer, we focus primarily on oncology. Not exclusively, but most of our work is there. We build models that understand the driver biologies of different cancers or behind different therapeutic opportunities.
We use these models to predict from patients sequencing data, from data that derive from individual patient’s tumors. What kind of drug, or what drug specifically is most likely to benefit that patient. We deploy our models in a couple of ways. One is with drug companies that are developing new medicines. Another is with diagnostics companies that are building diagnostics tests, that would help clinicians decide on a particular intervention.
[0:03:49] HC: These are supervised models, the input is sequencing data, and the output is something along the lines of – is this treatment appropriate or not for this individual?
[0:03:59] RR: Something like that. The input is sequencing data. The model will learn the relationship and the interaction between some number of genes. We like to use RNA sequencing data, rather than DNA seq, at least for now. We think it’s a more informative analyte. It gets you closer to the phenotype and provides a lot more variety of information. The models will output a classification. Does this patient’s tumor belong to phenotype A, B, C, or D, for example.
Based on that tumor phenotype, we then have a hypothesis for each phenotype of what therapy or class of therapy would be most useful. The model is not learning directly patient response to a particular drug. The models are learning the fundamental biological nature of the disease. From that, we can extrapolate what the best intervention will be.
[0:04:53] HC: These biological phenotypes, are these things that oncologists and other specialists have previously come up with that determined to be important in determining treatment, or where did they come from?
[0:05:06] RR: Sometimes yes and sometimes no. In one of our lead models, the four phenotypes model predict have been described in the literature in various ways. Not always using the same terminology, but it’s appreciated, for example, that some tumors are immune hot, and some tumors are immune cold. It’s appreciated as some tumors are highly angiogenic, meaning, they have lots of blood vessels in them, and some do not. So these kinds of phenotypes are known. But the way the model determines them, as you know, the kind of the consequence of intersecting biologies is actually pretty unique.
In other cases, we are coming up with categories or phenotypic groupings that require us to dig in and figure out, how would we describe this phenotype? Because it’s not obvious, or it’s not intuitive, or it hasn’t already been sort of captured in the literature?
[0:05:59] HC: What kinds of challenges do you encounter in working with the sequencing data and training models based off of it?
[0:06:05] RR: The biggest challenge right up front is pulling together the most appropriate datasets, right? So we need datasets that we use for feature discovery. That’s the kind of beginning part where we ask the questions, what biologies are most important to model for a given problem, and what molecules, what genes are going to best represent those biologists. We have to have a bunch of data that we use for exploring those questions, typically in a very data-driven way using bioinformatics and other systems biology tools.
Then once we’ve come up with some candidate gene signatures that represent our biology is of interest, then we need to do the machine learning, then there, obviously, you need training data, right? So pulling together a coherent training data set, that simultaneously is as free of bias as possible. So in other words, it represents your intent to treat population around the world as best as possible. But also, and also is of sufficient size to train your model. But also, where the patients have as few kinds of confounders in their clinical history as possible.
It’s actually a rather tall challenge to pull together a substantially-sized patient data set where the patients come from all walks of life and corners of the world, but also have enough in common that the data are meaningful together. We’ve built a lot of technology, a lot of our kind of secret sauces around how we do that, how we harmonize that kind of training dataset. Then, once we’ve trained a model, we need data, independent access to validate the model. This is typically where our partnerships come in. We work directly with drug companies that are doing drug development, so we can actually validate a biological model on sort of first and patient data. That’s really meaningful because there won’t be enough first and patient data to train a model on. If the drugs not on the market yet, maybe only 15, or 100, or 150 patients have even seen that drug yet. But we can use those precious clinical data for validation, and test our various phenotype to therapy hypotheses.
[0:08:07] HC: The sequencing data itself, could you – for those who aren’t familiar with this type of data, could you elaborate more on what that looks like, high dimensional, what are its characteristics?
[0:08:18] RR: Yes. Just to take a big step back, when we talk about sequencing data, this is the kind of data that arises when we set out to, for example, sequence the first human genome, which was undertaken in the late nineties and early two-thousands, and completed between 2001 and 2003. But this technology has expanded dramatically. So we can sequence DNA, and that’s the genetic code. We can also sequence RNA, which is the messages derived from that code that actually tell the cell what to make and what to do. We can sequence lots of other kinds of molecules as well. In Genialis’ hands, we really like to work with the RNA. Again, these are the messages. We believe the RNA holds a lot of information. These are very high-dimensional data. So depending on how you count it, there’s somewhere between 20 and 50,000 genes of genome, right? So we’re expecting to get, you know, measurements from all of those, and across however many patients, or preclinical samples, or what have you. The data are often a bit sparse, not all genes have detectable expression at once. Certainly, not all genes are going to be informative. We’ve built really beautiful software that allow us to aggregate these kinds of sequencing data, to process them in a very uniform way, layer on what we call metadata. So the clinical information, or the experimental information that’s crucial for understanding where those data come from, and to harmonize data that come from lots of different places. In other words, to pull data from different sources together in a way that they’re useful together.
We have that chunk of technology that lets us really work at scale with high throughput sequencing data. Again, we, today like to use RNA. RNA sequencing itself is at least a 15-year-old technology. It’s not — bulk RNA sequencing is not a cutting edge unto itself. But bulk RNA sequencing for clinical applications, where you have to go into a regulated environment, and have FDA approval and make sure that things are super consistent and robust. That’s actually pretty cutting-edge, and we’ve had quite some success there.
[0:10:20] HC: With this data being so high dimensional and sparse, and many of the other challenges you mentioned there, are there any specific techniques that you found that are helpful in dealing with the dimensionality of the data, and the sparsity?
[0:10:35] RR: Yes. We have a series of, I guess, you’d call them filters that we apply to data/. When we want to go from, again, a dataset that has 20,000 to 50,000 genes down to one that has somewhere closer to 20 to 50 genes, right? So we’re looking for – to reduce the number of genes that we’re inputting into a machine learning model by something like an order of a thousand. Part of that can be done with really standard bioinformatics, looking just at minimum expression value thresholds, a distribution of expressions across samples, et cetera. Some of that can be done based also on, whether these are genes that actually show the variance from one setting to another. In other words, between tumors, and normal tissue, or between different patients.
But we’ve invented some really cool methods for doing feature selection and feature reduction that explicitly address issues like bias, right? When I talk about bias, what I mean here really is just differences between any two datasets that arise for reasons other than the one you’re trying to measure, right? There may be differences between two datasets, has nothing to do with differences in the cancers, but rather differences in how those data were collected, how the patient samples were stored, what sequencing machine or sequencing platform was used. Biases can arise from gender, they can arise from genetic or ethnic background, and so forth.
We’ve invented some cool technologies that allow us to walk across these different, what we call axes or bias, these different sources of what will ultimately be noise in your model, and select only those gene features that are rigorous, and robust in sort of all of the settings in which we need them to be. This has been super helpful. It’s allowed us, for example, to build a model with a collaborator that was trained on gastric cancer data. Not only gastric cancer data, but microarray data. So an older form of data generation, and has been since validated to work in ovarian cancer, colorectal cancer, melanoma, et cetera. We have [inaudible 0:12:42] lung cancer, and we’re testing it on breast cancer.
It really is a pan-cancer model, even though it was trained on a data set that was just gastric cancer. And it works on RNA sequencing of all different chemistries, even though it was trained on microarray. The reason why we think, in large part, is because we took a really hard look at those sources of bias when we were doing feature selection.
[0:13:03] HC: You’ve mentioned two criteria with respect to bias. One is being careful about the patients you select. Two is being careful about the features you select. In mitigating bias, are these data aspects the main things that you do to tackle bias, or is there anything with respect to the algorithm the way you train it, that might also help them in mitigating bias?
[0:13:25] RR: Yes, that’s a really great question. The key with algorithm training, of course, is to try to avoid what’s known as overfitting. Overfitting is where a model learns the signal and the training data, but at the consequence of learning, training data specific signal that doesn’t generalize to other data. One thing we can do is we can be really thoughtful about the type of algorithm we select. Right now, on the tip of everyone’s tongue in the AI world are things like these large language models, foundational models, generative AI. These are super cool technologies. Undoubtedly, these will be transformational in healthcare. But they’re also – they tend to be very deep, and they tend to be fairly opaque black boxes.
What we’ve found when we’re talking about training models for clinical applications, where there are a few things that are true, one is the datasets tend to be smaller, and noisy because of the clinical heterogeneity of the patients. Two is, there are a lot of interested stakeholders who want to know how the model works, regulators want to know how the model works, physicians want to know how the model works. We are often better served by choosing a simpler, shallower artificial intelligence architecture, a shallow artificial neural network, rather than a deep neural network, for example. It sounds way less cool, and less sexy, but it works. The models actually learn really interesting patterns in the data that a human couldn’t learn on their own, and we can explain them.
This is one way to help minimize the sort of propagation of bias is by avoiding overfitting by choosing a simple learning architecture that matches the kind of data you’ve got available. Then, going to great pains to do the independent validation, and some explanatory work, right? Building - building tools that allow you to unbox the black box, and really look inside, and understand what models doing.
[0:15:18] HC: You’ve mentioned the validating your models generally there, and you’ve also mentioned regulation. How does the regulatory process affect how you develop, and how you validate your models?
[0:15:29] RR: Great question. The FDA has drafted something called the good machine learning framework. This is a set of guidelines that – if you really know the field, are fairly common sense, but it’s important to keep them in mind. We have these printed on a giant poster in our office, right? Everything we do operates within this framework of good machine learning practice, which requires, of course, keeping your validation data sets completely separate from your training data, keeping your training data separate from your discovery data, and so forth. But the way that we build these models, again, to predict phenotype where there’s more than one phenotypic class that can be output has a neat little advantage to it. It allows us to do hypothesis test.
Again, every phenotype that our model predicts whether it’s phenotype A, B, C, or D, has a different therapeutic hypothesis. We believe that phenotype A is going to benefit most from drug X, for example, phenotype B from drug Y, phenotype C from drug Z. We go out and find datasets that allow us to directly test those hypotheses. Here, we’re not just looking at model performance, but we’re also looking to understand that the model actually learn the therapeutic biology. I think that’s a key differentiator to how we do things.
Now, in terms of the regulatory context, we found the FDA is super excited about this stuff. The FDA absolutely recognizes the importance and potential value of machine learning in clinical applications. They have taken what I think is an appropriately cautious stance to how these kinds of tools need to be proved out. At the end of the day, if you’re building a diagnostic device using AI, or if you’re designing and developing a drug, discovered by AI, you still have to meet the same kinds of burdens that a traditional diagnostic device or drug has to meet from the regulator’s eye. It has to work. So we think about building out our validation schema to show that the models, at the very least will do no harm and the best to actually provide patient benefit.
[0:17:31] HC: Your team has published a number of research articles. What benefits have you seen from publishing your work?
[0:17:38] RR: I mean, the main thing is credibility. Everyone can say that they’re doing AI, everyone can say that they’re building tools, or assets that are going to be transformational in healthcare, but you kind of have to put it out there. So peer review is a trusted process. Peer-review articles are great. But also, you know, pushing the work forward to let the world kind of see how it goes. The biggest impact for us has been able to, you know, going out and telling the story. The paper we just had come out in frontiers of oncology is actually the culmination of several years of work. We have published those findings and more in conference posters for the last three years. Those kinds of public disclosures are absolutely key to get engagement from the community.
A big thing for us is we want collaborators, we want drug companies, and diagnostic companies, and clinical centers, and cancer centers to read this and say, “Hey, we want to try that in our shop. Let’s get together and do some retrospective studies on data that we’ve got available.” If it works well, we can go forward with building new models that are specific to our programs. [0:18:41] HC: Thinking more broadly about your goals at Genialis, how do you measure the impact of your technology to make sure you’re achieving what you set out to do?
[0:18:50] RR: Ultimately, we want to see impact on patients. This is actually why Genialis in the early days decided to focus on developing clinical applications, diagnostic type tools, initially, rather than starting with drug discovery. It was simply a question of how many years will it take before we can start saving lives. We anticipate one of the models we built being integrated into a pivotal phase three clinical trial that’s supposed to start and rolling this year. We’re really excited to see that. So, that’s kind of the long tail. We have to be patient, and wait to see what clinical benefit actually comes.
But in the meantime, we see that our models are being used and adopted by drug companies that are developing promising new medicines. If we can help them design clinical trials that are faster to enroll, cheaper to enroll, that are more likely to meet their endpoints, and ultimately more likely to get really good medicines to those patients who are going to benefit, that’s a huge impact. Even what I call incremental success of seeing a phase two, successful phase two clinical trial of a drug that we’re supporting with the biomarkers is a huge feather in our cap.
[0:19:59] HC: Is there any advice you could offer to other leaders of AI-powered startups?
[0:20:05] RR: That’s a really great question. I would say, start with your purpose, right? I’m saying, this is I do as I say, not as we did necessarily. Genialis originally started with a suite of technologies that we were super excited about and kind of went searching for the purpose. We found one, and we’ve adopted it wholeheartedly. But I do believe it’s ultimately better to start with the purpose. The truth is, AI technologies right now are becoming hyper-commoditized. You know, the technology giants, and even upstarts like open AI, which is not a giant, but it has giant impact. They’ve made publicly available some of the most powerful algorithms the world has ever imagined.
It is still possible for small companies to come up with really innovative algorithms. We see it as like Insilico Medicine did a lot of the pioneering work on generative AI in the healthcare context. But for the most part, it really matters how you deploy these technologies. I think you’re more likely to become what they call an exponential organization, or an organization that has huge scalable potential if your focus is on, you know, using resources that are generally available in novel ways, right? Instead of bogging down on trying to squeeze that last bit of accuracy out of a new model framework, start with what we’ve got and apply it to your purpose.
[0:21:25] HC: Finally, where do you see the impact of Genialis in three to five years?
[0:21:28] RR: In three years, Genialis is going to have a number of new predictive models on the market for some of the most important emerging therapeutic areas in oncology. In five years, though, I’m hoping to unveil something that’s going to be really, really game-changing. We’re working on what amounts to a comprehensive phenotypic landscape of cancer. My goal, and maybe it will be five years, it might be seven or ten. But my goal in five years is to have a working version of this, such that any cancer patient can have their tumor sequenced. It might be RNA, but it might be a different analyte by then. It doesn’t really matter. We can map it to the phenotype space.
But any cancer patient can have their tumor evaluated by a single assay, and be placed confidently in this comprehensive landscape, a map of cancer, where we know exactly what the driver biologies are, and we can associate those with the most likely therapies to work. I think what we’re going to discover is that for some patients, maybe for many patients, there is not an existing therapy that’s going to be best for them. And for those patients, because we’ve got a really detailed map of their disease biology, we can also really make an impact on new drug discovery, because we’ll know exactly what biology is. We need to go after what the next generation of drugs.
[0:22:45] HC: This has been great, Rafael. Your team at Genialis is doing some really interesting work for precision medicine. I expect the insights you shared will be valuable to other AI companies. Where can people find out more about you online?
[0:22:58] RR: Our website, obviously, www.genialis.com. Follow us on LinkedIn, we’re putting out a lot of new content and thought leadership. We also have a podcast. So if you’re into listening to podcasts, Talking Precision Medicine can be found on all major streaming services, also on our website. I do encourage everyone to go check it out. We talk a lot on our website and in other posts about what we call our people-first approach. This is really a dual philosophy of how to build our business and how to build our company. Putting people and humans at the center of the AI paradigm, but also putting our people, and our team at the center of how we work. You can learn all about that on our website.
[0:23:39] HC: Perfect. Thanks for joining me today.
[0:23:40] RR: Thank you, Heather. It’s been a pleasure.
[0:23:43] HC: All right, everyone. Thanks for listening. I’m Heather Couture, and I hope you’ll join me again next time for Impact AI.
[0:23:52] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe, and share with a friend. If you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.