Accelerating Materials Development with Greg Mulholland from Citrine Informatics

Sustainability is finally getting the attention it deserves as the global drive to reduce our carbon emissions gets more frantic each day. Thankfully, the progression of AI has accelerated the way materials and chemical manufacturers can go about their business in an environmentally friendly and sustainable manner.

Today I am joined by Greg Mulholland, the Co-Founder and CEO of Citrine Informatics, a technology company that is focused on accelerating the development of the next generation of materials and chemicals. We discuss the role of machine learning in Citrine’s technology, the challenges they are forced to overcome regarding their data sets, the model accuracy and explainability balance, and how Greg and his team validate their models. There is no doubt that Citrine’s work is vital for the global sustainability effort, and our guest explains his company’s collaborative programs, how publishing research articles has boosted Citrine’s profile, what this AI-powered business hopes to achieve in the next five years, and so much more!

Key Points:

Introducing Greg Mulholland, his professional background, and how he ended up at Citrine.
Greg explains what Citrine does and why this work is important for sustainability.
The role of machine learning in Citrine’s technology.
Taking a closer look at Citrine’s data sets and the data challenges that they encounter.
The techniques that Greg and his team use to successfully handle small data sets.
Examining the balance between model accuracy and explainability.
How he validates his models.
An explanation of Citrine’s collaborative program with external researchers.
The benefits of publishing research articles.
Greg’s advice to other leaders of AI-powered startups.
His vision for Citrine’s impact and influence over the next five years.

Quotes:

“I trained as an electrical engineer and got into material science because I believed that material science was really an important technology set of disciplines that we needed, to solve the world's most pressing environmental challenges.” — Greg Mulholland

“We started the company 10 years ago now; we've been able to show that machine learning and artificial intelligence, among other things, can be used to really accelerate the future of the materials and chemicals industry. It was the vision all along, but it really required a lot of technology development and we're really proud of how far we've come.” — Greg Mulholland

“The scientists in our community are brilliant people.” — Greg Mulholland

“Explainability is important. Accuracy is also important. Neither is dominant over the other. It turns out, a less accurate model that is more explainable can often help unlock new thinking in a scientist's mind, that then unlocks the next-generation product.” — Greg Mulholland

“Publishing what we do as a starter for more conversations; I think it helps us attract good talent. It helps people understand that we're doing cutting-edge research and continue to invest in driving forward the field. I take it as a little bit of a feather in our cap and a source of pride that we get to help the world move along into this new era of AI.” — Greg Mulholland

“We've seen companies remove toxic chemicals from important products much more quickly than they could have otherwise. We've seen companies reduce their energy consumption. We've seen companies reduce costs and reduce carbon input. Those are all really exciting to me.” — Greg Mulholland

Links:

Greg Mulholland on LinkedIn
Greg Mulholland on Twitter
Citrine Informatics

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[0:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven machine learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people and planetary health. You can sign up at pixelscientia.com/newsletter.

[INTERVIEW]

[0:00:34] HC: Today, I’m joined by guest, Greg Mulholland, Co-Founder and CEO of ine Informatics, to talk about developing materials and chemicals. Greg, welcome to the show.

[0:00:44] GM: Thanks, Heather. It’s a real pleasure to be here.

[0:00:45] HC: Greg, could you share a bit about your background and how that led you to create Citrine?

[0:00:49] GM: Sure. I trained as an electrical engineer and got into material science because I believed that material science was really an important technology set of disciplines that we needed to solve the world’s most pressing environmental challenges. When I got into the industry, what I realized is that while I had been using computers in every area of my life, I wasn’t using any advanced analytical technology at the time, this was in the early aughts, to learn new things about my materials.

I was still doing things in the same way that Newton or Edison would have done them. Very manual, very intuition-driven. It felt to me like, time was ripe for a revolution in the chemicals and materials industry, that we started to make better decisions using these modern technologies. At the time, I didn’t fully know what was possible, but ever since then, we started the company 10 years ago now, we’ve been able to show that machine learning and artificial intelligence, among other things, can be used to really accelerate the future of the materials and chemicals industry. It was the vision all along, but it really required a lot of technology development, and we’re really proud of how far we’ve come.

[0:01:57] HC: What all does Citrine do, and why is it important for sustainability?

[0:02:02] GM: What I sometimes like to say is that in the chemicals and materials industry, we’ve done all of the easy stuff, and actually, we’ve done a lot of hard stuff, too. If there were obvious ways to find the next generation of bioplastic, or biodegradable other materials, recyclable metals, and plastics, better chemicals that we can use from everything from agriculture to the creams and lotions we put on our skin, these are all areas that we’re needing to rapidly advance and become both more sustainable and more efficacious. How do we target things more appropriately?

What citrine does is we use fundamental chemistry and physics, alongside material science and chemistry data mine, with this very special class of machine learning that we have developed, to accelerate the development of those materials. Someone can come to us and I’ll use a baking example, say you’re making cupcakes. You want your cupcakes to be moister than they usually are, much softer and have that really nice taste. Our system would take all of your historical cupcake data and all of the physics we understand about the baking process, and it would recommend new recipes, combinations of ingredients, the order in which you mix those ingredients, the baking process you use, and all of the other processing, rolling, or whatever you might do along the way, to make the perfect cupcake.

We do this across the materials and chemicals industry, so including things like, I mentioned lotions and creams, things like soaps, things like adhesives, the alloys, batteries, just really across the spectrum of materials and chemicals, and almost always, we’re looking to improve performance, reduce cost, and reduce toxicity and carbon impact, because that triple bottom line is really the calling card of most companies these days, developing new products.

[0:03:45] HC: You mentioned machine learning earlier on, but what role does it play in your technology?

[0:03:50] GM: Yeah, it’s interesting, because one of the big things in our industry is that if you look at data across the globe, there is a lot of it, right? We see these LLMs that are being trained on the entire internet that can answer questions and all these cool things. In the materials and chemicals industry, we don’t have the benefit of free, huge volumes of data. Every experiment costs money, and to give you a sense of what the cost can be, a single adhesive, or paint, or something, a material that you blend together primarily in a soft material can cost about 10,000 bucks to test one of those. Plus, or minus a little bit, but that’s the ballpark.

In the world of alloys and batteries and these hard materials that require a lot of processing, it can be millions of dollars to test a single material thoroughly. Our machine learning is there to take the small data that we have alongside this physics and basically, be the glue that allows us to connect what we understand about materials to what the data is telling us. Because materials are a really interesting, interesting domain. I think a lot of people think about chemistry as atoms and molecules. But when we talk about materials and chemicals, we talk about the whole stack, all the way to the thing you’re using.

What I sometimes like to tell people is the atoms that are in the plane wing are not what keep the plane wing and the plane in the air. It’s the atoms and the crystals they form and the different textures within those crystals and there’s all these different length scales. Our machine learning, the role it plays, to answer your question very explicitly, is that it connects those length scales. We understand each of the length scales quite well. But the machine learning is the glue that brings them together.

[0:05:33] HC: The data you’re working with with these machine learning models is related to the materials and chemicals. But what does that data look like?

[0:05:41] GM: Yeah. Well, before we arrive very often, it’s not. The parallels to what Newton and Edison did really go very deep in our industry. The history, or the way we find data a lot of times, I mean, sometimes it’s written in narrative form. It’s literally closer to a New York Times cooking recipe than it is to some structured data file. Because we run into data along the way that’s well structured, but it’s very often a smattering of spreadsheets across an organization that needs to be stitched together into what a computer scientist, or a data scientist would look at as relational, or interconnected data.

The kinds of data we typically look at falls into four categories. The first is what we refer to as composition. In the plastics world, it’s what molecules do you have, or how are those molecules structured. In the alloys world, it’s what bucket of various metals have you mixed together? It’s really the input ingredient space. The second is what we call processing, which is exactly what it sounds like. It turns out, if you blend something at a different temperature, or you process something at a higher pressure, or it turns out, I’ll use the cupcake example again, if you bake at a 1,000 degrees for 1 minute, or 30 minutes at 300 degrees, those two things are not actually at all the same and your cupcakes will taste very different, understanding that processing history of a material.

Then the third of four that we pay attention to is what we call structure, which is the outcome of the first two. What is the texture of the material? What are the grains that are forming? How do the various components of it overlap and interact? Then finally, the fourth thing, which is what we care about optimizing is the performance of the material. The performance often shows up in the form of properties. What is the density, or the weight? What is the strength? Is it a transparent material, or not? What is the viscosity, if it’s a liquidate? What are the characteristics of the output?

What Citrine does is we are able to find very subtle relationships among the first three and correlate those to the fourth, and in using a set of optimizers that we built that understand chemistry quite well, we can very quickly drive to what is the recipe that is going to get you to your desired performance much more quickly than a human could using just their own brain.

[0:07:54] HC: What kinds of challenges do you encounter in working with the various types of materials and chemical data?

[0:07:59] GM: It’s interesting. The scientists in our community are brilliant people. Usually, the data always has messiness to it, right? It lives in the real world and the real world is not a clean place. Generally, the data we work with is relatively well thought out, relatively clean and where it is documented, I mean, sometimes things are lost to history, but where it is documented, scientists are pretty rigorous people on average.

The trouble we tend to run into is that they are very – the data is very rare. We might go into a company and they might be trying to improve their overall performance by 10%, whatever that performance metric might be for their product. That’s usually a big leap. 10% is a lot. Then we’ll say, “Well, how much data do you have?” They’ll say, “Well, we have 60 examples of materials where we’ve made this material and tested it before.” Most machine learning, I mean, Heather, you know this, most machine learning would see 60 and think, “Well, probably not worth learning on.” Our system can thrive in that environment. That’s what we seek to do.

The challenges are really, getting the data into a single place and learning from it as effectively as possible. When we get data, it’s usually of reasonably high quality, or at least it’s in the correct ballpark. There’s not a lot of crazy, erroneous stuff. The data volume problem is the one we’re constantly battling. I think there are opportunities for us and for others to start making data more available in the industry.

[0:09:25] HC: Yeah. The other area where I see the lack of data as a consistent challenge is a lot of medical imaging applications. If you’re working towards clinical trials and a fewer than a 100 examples might be fairly common there.

[0:09:38] GM: It’s interesting in that space, and the same thing is true in chemistry, right? It’s not that the data doesn’t exist. I mean, there are medical imaging. I mean, while we’ve been talking, there have probably been tens of thousands of medical images taken in the US alone. It’s the business barriers and legal barriers to sharing that data. The same is true in the chemicals industry. If you take two major chemicals players that compete against each other, there is no world. They are going to allow sharing of data between those two companies, even if the outcome would be better for both of them. It’s just not in their best interest to risk that data leakage. Obviously, HIPAA has a lot to say in the medical world. Yeah, it’s a similar set of problems. I think the answers are very domain specific.

[0:10:19] HC: Well, in your domain, how do you handle the small data sets? Are there specific techniques you’ve found to be most successful in your domain?

[0:10:27] GM: Yeah. There are a few different ways that we do it. The first is that we, sort of like medical imaging, I would say, even to a much stronger degree. In our world, we have the benefit of pursuing truth. If you look at everybody always, or a lot of people who use the Netflix recommender algorithm from long ago as their example. That’s a case where human preferences shift over time. Whereas in the end, materials in chemistry is trying to learn and exploit the laws of physics. As we do that, we can start to learn some underlying phenomena that actually, we don’t need to relearn from data every time.

Say, you’re a specialist chemist, and you’ve been working in a big chemical company making plastics for 20 years. You know a lot about that class of materials. You might know that if you use a higher molecular weight polymer, then it increases the hardness of the plastic. If you make it more transparent on average, it makes it flimsier. If you cook it at a hotter temperature, it becomes brittle. You might just know that stuff. You don’t need to look at the data. You’ve learned that over your 20 years.

The way we approach it is, is we actually use a hybrid form of machine learning that allows us to pre-weight and pre-inform the models, and it’s not even us. It’s the scientists who do the work, can pre-weight and pre-inform the models. Then the data comes over top to find new relationships that might have not been identified by the scientist and reinforce, or discourage the relationships the scientist has entered.

Rather than being this, I’ve gotten asked a lot recently in this LLM craze about will people’s jobs be taken? Certainly. There are certainly jobs that will be at risk. There is no question. I think in the scientific community, we see this more as a source of leverage and expansion. It’s a super calculator for a scientist. It’s not a replacer of scientific intuition, simply because we do not have the data to learn. If you had infinite data, you can learn all the science, but we will never have infinite data, and so we will always, or at least for the foreseeable future, as long as we are all alive, need people involved to take advantage of these tools with the data in hand.

[0:12:40] HC: How do you think about the balance between model accuracy and explainability? Is one more important than the other for you, or do you have to strike a balance between them and training models?

[0:12:51] GM: Yeah, it’s interesting. When we first started, we have this mentality that accuracy is the most important thing. If you get it right, the answer is the answer and it’s proved out, and the value is there. I think we’ve learned actually, is that because it is this hybrid model where we have a human and a computer working in concert with one another, there’s – explainability is important. Accuracy is also important. Neither is dominant over the other. It turns out, a less accurate model that is more explainable can often help unlock new thinking in a scientist’s mind, that then unlocks the next generation product.

Very often, because we work in a small data limit, initial estimates of accuracy, if you’re not careful, and then we are very careful, you can get into the overfitting regime very, very quickly. What we have found is, accuracy is valuable up into some threshold. Beyond that, being able to elucidate why the model is saying what it is saying, and making sure that a scientist can affect those things. I mean, in a lot of ways, it’s creating a conversation back and forth between the AI and the scientist.

Then so doing, that’s how we see the best results emerge. When we get tested, sometimes we’ll have customers come and say, “Well, you didn’t get to an R squared of 0.99.” My response is, I can make that number whatever I want it to be. If we need to get to 0.99, any machine learning, or data scientist, machine learning engineer, or data scientist knows how to build a system that can check a box. But it’s not about checking a box. It’s about getting to a predictive enough real-world outcome that it becomes a tool that’s useful in the development of new products, not model accuracy for model accuracy’s sake.

[0:14:34] HC: How do you go about validating your models? Is explainability part of that? Or what other means do you have to validate them, especially when you have a limited amount of data?

[0:14:43] GM: Yeah, so there’s good and bad. I mean, the first way to go about it is the explainability aspect. A scientist who’s been working in the field for a while looks at it and says, “Yeah. I mean, based on my experience and my own logic, you’re saying these correlations exist?” Yeah. Actually, mostly they do. I had a particular point of pride. We were working with a very famous scientist at Northwestern University who was – had developed a whole class of materials, who she was super famous for and is incredibly, incredibly talented. We built a machine learning model on her data. This is many years ago now.

I remember I got a call from her and she said, “You’ve identified this particular parameter in the data.” I think it was the polarizability of the crystal as really, really important. She said, “I never thought about that before, but it makes perfect sense.” For me, that was a real breakthrough moment, where we actually took an expert and helped them see their own field in a different way. It was a really positive experience. Then the way we really validate and the way our customers validate is to say, “Look, you can build models all day long, but we’re here to make chemicals. We’re here to make materials.”

Someone will go and they’ll pick the five, or 10 recipes that they the most, or think are most successful, or using whatever criteria they might use. They’ll go make the thing and they test it in the real world and feed it back in. For me, that is the gold standard. There’s a lot of talk about predicting all these energetics of molecules and protein folding and all this stuff. A lot of times, that’s unsimulated data.

Having it play out in the real world and actually work in an experimental capacity has always been our guiding light. I think, puts us at a big advantage over folks who are focused more on the theory, at least in terms of developing new products. Theory has a lot of value in a lot of places. For us, our guiding light is always that experimental result.

[0:16:31] HC: I saw that Citrine has a program for working with external researchers. How does this collaboration work?

[0:16:38] GM: We have two types of researchers we typically work with. The first is, as you can imagine, machine learning researchers in chemistry, folks who are doing groundbreaking work and we want to be close to them and learn from them and have them learn from us. We’re a very active part of the scientific dialogue in computer science and data science and in machine learning engineering, particularly at the small data limit. Those are generally one-on-one, quieter collaborations, simply because that’s very close to our core product.

We also have a relatively robust, extremely robust set of activities working with academics, where they are material science researchers, or chemical engineering researchers, or chemistry researchers, people who are not experts in machine learning, or artificial intelligence. Where the role we play is to bring this technology into their domain. We work with their students and the PIs at these universities to both secure funding and execute on programs. It’s our goal to – I mean, we use it for a number of reasons. One is to learn more about how people use these things and how it can be effective in new domains, but also, to show that these technologies can be used both in an industrial capacity and also in groundbreaking new research.

The person who leads that effort at our company, his name is James Saal. He has done a really just phenomenal job carrying forward our mission to accelerate the development and deployment of the next generation of materials and chemicals into the academic world. It’s really inspired, I think, a lot of students to bring these tools, whether ours or other ones, into the research enterprise, because the technologies that are going to save the planet 10 years from now, or 20 years from now, some of them are at companies today, but a lot of them are at universities today and are just waiting to be brought out. The faster we can get those out the door, the better off we are as a human society.

[0:18:28] HC: Your team has published a number of research articles. What benefits have you seen from publishing your work?

[0:18:34] GM: It falls into those two categories that I mentioned. The first is that there is certainly a benefit when we publish new machine learning technologies, or approaches or frameworks, what have you. Getting feedback from the community can be really powerful. What we find very often is we’ll publish one idea. For example, we published some very early work on uncertainty quantification, and particularly, the materials machine learning community has really taken that and run with it and developed all kinds of new, really cool uncertainty quantification frameworks. It feels that we were one of the seeds, certainly not the only one, but one of the seeds in creating that conversation.

I think it’s been healthy for the community to have folks like us who are focused in this space, but also very applied. Publishing what we do as a starter for more conversations, I think it helps us attract good talent. It helps people understand that we’re doing cutting-edge research and continue to invest in driving forward the field. Really, I take it as a – it’s a little bit of a feather in our cap and a source of pride that we get to help the world move along into this new era of AI.

On the other side, obviously, when we work with material science research for developing new materials, that’s a little different. We, a lot of times take somewhat less, I don’t want to say somewhat less credit. We participate fully. But the goal there is to get the work published and to help people understand what materials breakthroughs are there. To the extent machine learning helped, of course, we want to acknowledge that and show people how effective this has been as an approach. I think it’s really a one-two punch of hey, we discovered this really cool new battery material and it’s exciting to be part of that conversation. Also, showing that it’s a machine learning-enabled battery material discovery is a really powerful – it’s a powerful thing to be able to say.

That definitely, as we tend to work in sustainability-oriented domains, especially on the academic side, they’ll really across the board. A lot of folks on our team, I mean, and I connect with every single one of them. Many, many folks join our team, because of our commitment to sustainability and commitment to making the world better via materials and chemicals. How do we create healthier ways of living? It’s really a win all the way around. It’s scientific progress. It’s societal progress. It’s recruiting good talent to – and it’s boosting Citrine’s brand, which is a distant fourth priority, but still a nice thing to have.

[0:21:04] HC: Those are a number of great benefits. I definitely love to read the published work from a lot of the companies that are putting it out there. It’s great to have more advancements in this field. Is there any advice you could offer to other leaders of AI-powered startups?

[0:21:20] GM: This one may be a little bit controversial, but why not? I think we’re at a moment now where AI is, I don’t think – I know we are in a moment now where AI is a center of attention in a way that it hasn’t been for nearly a decade. There’s always been a lot of promise, but this LLM conversation and foundational model conversation, generative AI, all those sorts of things are really bringing this to the fore.

The risk, I think, that a lot of companies run is that they create themselves to be an AI company. I don’t know that that is the best way to go about it. My advice would be, focus on what value you want to deliver to the folks you work with, to your customers, to your partners. If you deliver value, whether it’s using AI or not, that is what creates a successful business. I think, probably for the next 18 to 24 months, roughly, just saying AI probably gets you enough attention that you can get some momentum from that.

I think, if the focus strays from just creating value for a customer, you’ll end up building something that doesn’t have staying power. I think it’s on every leader of an AI company or company that uses AI, which is how I think of Citrine, it’s incumbent on us to look at ourselves as creating something much bigger than AI, where AI is but one tool in the tool belt for delivering for our customers and against our mission. That’s even I, I will admit, have been just absolutely stunned by what these LLMs can do and excited by it and have done my fair share of toying around with them and intensive research into them. It’s okay to have those moments, and we should have those flights of fancy. But when it comes down to brass tacks decision making, I choose customer value over AI flash every single day.

[0:23:00] HC: Finally, where do you see the impact of Citrine in three to five years?

[0:23:04] GM: We’ve already started to see some of it. We’ve seen companies remove toxic chemicals from important products much more quickly than they could have otherwise. We’ve seen companies reduce their energy consumption. We’ve seen companies reduce cost and reduce carbon input. Those are all really exciting to me. I think we’re going to see more of that. What I’m really, really excited about and what I think the impact will be longer term is that for a long time, the way new products are designed has been, you have a list of materials, chemicals, whatever inputs you might use. If you’re designing a new phone, or keyboard, or something, you pick from that list.

For phones and keyboards, it’s probably fine. When you think about something like a car, or a plane, or these large high-energy consuming devices, it turns out if you can pick exactly the right material, an engineer, exactly the right material into the brake rotor of a car and make it 1% lighter, that actually fundamentally affects the energy consumption of that car for its entire life. My goal for Citrine over the next three to five years, and hopefully even sooner, is full integration with three-dimensional design. How do we get a 3D design package, that car designer, when they’re designing the brake rotor to not just think of materials as a list, but to think of materials as one of the levers they can use to create the most efficient, most exciting product for their customers.

I think Citrine now has the technology and has been developing it for a long time to create materials as a degree of freedom in the product design process. If you just think about how that would change the industry, I mean, it makes materials and product companies connect in a way they never have before. I think it creates really exciting opportunities, both for new products and for a more sustainable future for the planet.

[0:24:51] [0:24:51] HC: This has been great, Greg. Your team at Citrine is doing some really interesting work for material science. I expect that the insights you shared will be valuable to other AI companies. Where can people find out more about you online?

[0:25:01] GM: Absolutely. They can find us at citrine.io. We’re always looking for great people and always looking for great partners. Please, feel free to reach out, and excited to connect with your audience.

[0:25:13] HC: Perfect. Thanks for joining me today.

[0:25:15] GM: Absolutely. Thank you for having me, Heather. It’s been a real pleasure.

[0:25:17] HC: All right, everyone. Thanks for listening. I’m Heather Couture. I hope you join me again next time for Impact AI.

[END OF INTERVIEW]

[0:25:28] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. If you’d like to learn more about computer vision applications for people and planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.

[END]