Daniel fills us in on the issues Astraea aims to solve and the role of machine learning in its mission. We find out what makes satellite imagery unique (and uniquely challenging to work with) and how Astraea ensures that its models continue to meet customers’ needs over time. Daniel shares insight into the ML development process and advice for other leaders of AI-powered startups. Tune in to discover the balance between model accuracy and explainability, the importance of transparency when it comes to voluntary carbon markets, and more!
- Daniel Bailey’s background and how it led him to create Astraea.
- What Astraea does; the planetary problems it aims to solve.
- The role of machine learning in Astraea’s technology.
- The insights Astraea extracts from satellite data and the models they use to do so.
- What makes satellite imagery unique (and uniquely challenging to work with).
- How Astraea ensures their models continue to meet customers’ needs over time.
- The balance between model accuracy and explainability.
- Astraea’s ML development process.
- The first steps to solving the business case with ML.
- The importance of involving stakeholders in the development process.
- Daniel’s advice for other leaders of AI-powered startups.
- Why it’s critical to stay focused on the business needs.
- The training data required to meet global needs.
- Daniel’s vision for the future impact of Astraea.
[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven machine learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.
[00:00:33] HC: Today, I’m joined by guest Daniel Bailey, CEO and Co-Founder of Astraea to talk about leveraging geospatial data. Daniel, welcome to the show.
[00:00:42] DB: Thanks, Heather. Thanks for having me. Very excited to talk about geospatial. I never pass up an opportunity to talk about the intersection of business geospatial and AIML.
[00:00:51] HC: Daniel, could you share a bit about your background and how that led you to create Astraea?
[00:00:55] DB: Sure. So I’m going to date myself here. I’m a little bit old. In the nineties in college, I thought I wanted to be a doctor. I always wanted to do something to help people very mission-driven. I decided I didn’t like spending my time in a lab 24/7. So I switched over to the kind of math stats and then found myself still having that need to help people and decided to join the army.
That really kind of introduced me to the space, late nineties. That was pretty heavy times being in your 20s in secret compartmented information facility and being able to pick up the red phone and requisition satellites when they weren’t commercially available. So that kind of had me catch the bug of what the power and opportunity was and the amount of data that could be collected from these earth-observing platforms.
Kind of fast forward after eight years and deployment in Iraq and where I got to live and hope, secure the first free national election in Iraq in the early 2000s, post 9/11. I realized that just how much importance and impact it can have to make evidence-based decisions. So when we fast forward in the AIML, revolution kind of came back around. I got an opportunity with a good friend of mine here in Charlottesville to kind of fuse my paths with satellites with my ML background and start Astraea to try to solve some of these big challenges and problems we’re facing from climate change to a lot of sustainable development, water, other things that we’re facing to do as a civilization.
[00:02:31] HC: So what does Astraea do, and why is this important for solving many of these planetary problems that you mentioned?
[00:02:37] DB: We started in 2008 around the concept that more and more data was coming down from space. Space was commercialized in 1992 with the first satellite going up in 1999. The Obama administration in 2013, they, with his open government initiative, made troves and troves of satellite imagery from NASA. Then the European Space Agency joined in on that to provide tons of satellite data that had largely been untapped to look for questions.
As you know and AI know, we often work on found data, data that wasn’t collected for that, for the purpose that we’re interrogating it for to see if we can find hidden insights on it. So in 2008, we came together, started as Astraea, decided that we needed to build capabilities to make this data broadly accessible and available for people working to address these big challenging problems that we’re facing.
[00:03:33] HC: What role does machine learning play in your technology?
[00:03:37] DB: It’s really core, going back to the early days, all the data coming down. I’m sure you’ve looked at it. There’s more data, kind of the golden age of data, big data that led to this kind of golden age of AI. Now, increasingly, we can put sensors on everything. IoT started and now opening up to space, 14,000 satellites zooming around the earth collecting data. We’re kind of in this golden age of measurement. There’s more data than you can look at individually. You really have to have something like AIML to recognize those patterns and extract those valuable insights out of the data. So it’s absolutely crucial and core to everything that we do.
[00:04:18] HC: What kinds of insights do you extract? Maybe you have a couple examples of models that you train in order to extract some new insights from satellite data.
[00:04:27] DB: Yes. As you can imagine, I mean, there’s a lot of different satellite data types, and we have a principle that we really tried to use the simplest model as possible that is fit for purpose to answer the business needs. So we do every – we run models, everything from random forest to deep learning, computer vision models, and other models.
It’s important for us because we do things like monitor millions of acres of forest that has been put in conservation easements and others and the offset. You hear how today it’s – those are becoming more are more critical for helping avert and sink some of the carbon that we’ve built up over the last 30, 40 years with the fossil fuels. So we’re able to monitor those forests to make sure that people are abiding by those offset rules and not illegally harvesting the trees and helping protect the forest.
We’re also using it in other ways to better specify the grid as we’re working with a lot of renewable energy developers and the energy transition to help them find substations, electric grid lines that are poorly mapped and poorly understood. That opens up a number of parcels that can then be considered for putting these renewable power-generating assets on. Then, of course, as a sustainable focus, we all – in everything that we do, we make sure that our customers have access to the best information around critical habitats, water resources, and other supporting information that they make their decisions.
[00:06:01] HC: So getting a little bit more specific with some of the machine learning, you mentioned infrastructure. So I suspect that some of the models you’re training might just be to detect infrastructure. Would that be one example?
[00:06:14] DB: Yes, absolutely. We detect infrastructure, computer vision models, understanding, characterizing what a substation looks like. Where is it at, first of all, and what does it look like? What kind of capacity does it have on it? Could it actually handle additional load going into it for renewable energy developers?
[00:06:32] HC: What kinds of challenges do you encounter in working with satellite imagery?
[00:06:36] DB: Satellite Imagery is a unique beast, for sure. Having been in the space or AIML space and done everything from marketing analytics to anti-fraud analytics and this type of work with satellite imagery, the dimensionality of the data is completely unique. It has the base component. Where is it on the ground that has the temporal component? You’re right there dealing with spatial temporal data, which is challenging in modeling, which is challenging as you know.
Then you add in the scientific data, that spectrum, the different bands, and other sensor types that it could be made of. So you very quickly get this data cube that is big, and a single image can be a gigabyte of size, just one look in time over one area. So you can imagine very quickly the amount of computer infrastructure and capabilities you need to just deal with the data is huge and the process and profit to get it to a point where you can do anything with it.
Where we find ourselves with that, it’s kind of funny in our spaces, is that because that’s so much work, and we think biology, basically, have a tendency to be like, “Wow, look at all my hard work. We want to tell everybody about our hard work.” As you talked about, most of the business people we’re working with, they just want an answer. So we find ourselves sometimes kind of thinking about all the technology and speaking on the technology. We really have to think about not just how cool it is to be in space and satellite imagery and the data, but how do we turn that into the business problem and meeting the need of the customer that we have.
[00:08:12] HC: So once you have a model and once it’s deployed and out there and doing its job for your customers, how do you ensure that it continues to work over time? What if something in the data changes? How do you catch that?
[00:08:26] DB: Yes. Making sure we understand how our customers are leveraging the model and what kind of impact is important, so they get that information. We do run a lot of traditional techniques. We do champion challenger techniques so that when we have a model in production, we’re constantly looking for a better model and innovating on that capability. We are looking at new data feeds. We are constantly in space. You’re getting more and more sensors going up. So we have an opportunity to fuse more data. That creates a better model over time and can be – we can innovate on the model that we have in production at that time.
[00:09:01] HC: How do you think about the balance between model accuracy and explainability? Are they both important for the types of models you’re developing? Or is one really more critical than the other?
[00:09:12] DB: That’s an interesting question. So in my time in anti-fraud modeling and others, explainability has become very important, and it is in ours, too. As you can imagine, we work in a couple of modes, and so accuracy is paramount. So making sure that we know where on the earth and what type of substation that is is important.
Explainability comes important, but we have one of the advantages with working with satellite imagery is we have a really good picture. So when our model flags, we can actually show the customer what it looks like on the ground at that time and why it is. So we get a little bit more. Even though we use those advanced techniques of Convolutional Neural Nets and others, they’re really often less explainable. They get the advantage that we are able to tie it to a specific instance in time that can be interrogated when our model flags it as a concern.
So this really comes in handy for our forest loss work that we do for the land offsetting. Because if a landowner has an event like that and they’re getting paid to offset their land, if some type of illegal harvesting, having that image really helps assure that the accuracy and explainability is there for the model.
[00:10:24] HC: So just due to the visual nature of this being images, you can point to exactly what they’re looking at and explain predictions that way, whereas for some other modalities of data, it’s much harder to look at, and to get interpretability from that point of view.
[00:10:42] DB: Yes, absolutely. It’s evolving. Our space is evolving, some of the more challenging problems. That’s the theme as it is today and kind of monitoring those use cases. As we move into some of the new sensor types that are going up there, we’re able to do things like measure greenhouse gases and do carbon modeling, carbon sink, and emission modeling.
That’s a little more challenging, and that’s where explainability does really start to come into play and having ground truth data. Because by nature, we’re measuring from space, which is 600 kilometers up in the air or more, and you got a lot of complicating factors. So understanding who was emitting something, it has to be modeled, and you have to really work on the science. That relies on having better and better ground truth data to calibrate the sensors that are in space to what’s going on on the ground.
So right now, accuracy is kind of driving it. But explainability for the application of AI and now to satellite imagery is definitely on the horizon and becoming – going to catch up with the rest of the explainability discussion in AIML that we’re hearing today.
[00:11:49] HC: So going back to machine learning projects themselves, how does your team plan and develop a new ML product or feature? What steps do they take particularly early in the process?
[00:12:00] DB: Yes. I think it’s a – our big driving principle there is business need, making sure that we have a good understanding of what the business need is. It’s easy to get in love with the tech. By nature, we know AI and ML technology is cool to begin with and ChatGPT. You throw in some space in there too. Getting the data from space, it’s a technologist’s dream, in my opinion.
But without really focusing on the business need and then having that business case justification, you can kind of be at risk of creating a hammer and being like, “Hey, here’s a hammer. Look at my cool hammer. Where’s the nail?” We’ve seen that over and over in our space where a lot of publicity around counting cars and Walmart parking lot. That’s great. But when you actually do the business case, by time you source the data and apply it, you’re looking at millions and millions of dollars to be able to count those cars. The value of that isn’t there.
So when we think about creating ML products and features within the product, we think about taking, using the most simplistic approach first. If we can solve it [inaudible 00:13:12], we solve [inaudible 00:13:13]. If we need more complexity with Convolutional Neural Net, we’ll apply that. As you know, software isn’t cheap to build, and neither are AIML models. So we’re really looking to make sure that we can build features that drive value and do that as quickly as possible and then iterate.
[00:13:34] HC: So once you’ve got the business case figured out, what’s the next step in terms of figuring out how to solve it with machine learning? Is your team in the beginning maybe focused on exploring the data or getting that simple baseline solution up and running? Or are they digging into the research? What’s kind of the first step or something that I haven’t mentioned?
[00:13:57] DB: I’d say that you’ve pretty much categorized all of them. We always do a research project to begin with. We’re looking at a problem. We want to know what’s the state-of-the-art that’s out there. What’s been done before? That usually looks at different publications, spending a few days to maybe a week, really digging into there and understanding what’s been done before.
Once we get a sense of that, we come together. We specify the problem, and we establish our first iteration, which is, to your point, to get to a POC. We want to take with as little data preparation as possible and the quickest modeling as possible. Establish a baseline against the ground truth data that understanding how well can we do with very little. If we see then, if we see some response, right? We see some good – some reasonable results, some reasonable accuracy or whatever statistic we’re using, intersection over union or other statistic, then we’ll come back together and value that and determine where do we go next.
Usually, what we like to do when we build all this is work with our customers and make sure that we are in dialogue with them, that we’re not building in a vacuum. We’re working with some of our cultural customers. We want to make sure that we’re getting those results [inaudible 00:15:10] into their hands and understanding that they’re useful for them. That’s one thing that’s critical from all this is just the iterative approach to building out the model and then a very experimental-driven design process.
[00:15:25] HC: So first step, literature search. Second, baseline solution and then iterate. But the other key part that you mentioned is making sure all the stakeholders are involved in the discussions and are involved really throughout the process so that you converge to the most appropriate solution for the thing that you’re trying to solve.
[00:15:46] DB: Yes. That sounds dead on. We largely – you’re probably familiar in the cross-industry standard process for data mining [inaudible 00:15:53]. But they really did. They got it well with a lot of that, making sure you understand the business need, evaluating the data, getting to the baseline, and then iterating. We generally follow that process. We’ve augmented it some, but it’s a really sound process, it’s stood the test of time.
[00:16:15] HC: Is there any advice you could offer to other leaders of AI-powered startups?
[00:17:20] DB: I think the biggest one is just stay focused on the business need. There’s so much cool tech. The space moves so fast. I mean, look at the day, and ChatGPT is like already – there’s how many news articles coming out that ChatGPT is yesterday’s tech, and here’s the next 10 startup that’s going to be – I mean, it’s just they move so fast. When you’re in there, it can be building the business around it. Not just doing the tech but building a business around it. It can be overwhelming at times.
So staying focused on the business need, what’s your mission, why did you do this to begin with, and what are your customer’s need is critical. I think, as we’ve already discussed, if you can stay focused and then continue to use the simplest approaches to deliver quickly, deliver results, as you’d like to say, and driving for impact, and iterate, and just keep iterating.
Then I think the third thing is an interesting one, and some people fall on different sides of it. But I really think it’s embrace the open ecosystem approach. We have contributed. We benefited from open-source software and techniques. We’ve contributed open-source software and techniques with Medusa. They’re very much a yes and. I think especially in our space in the geospatial AI intelligence space, it really is about going to take a community to attack and to provide the capabilities we need to resolve some of these intractable problems we’re facing as a planet.
[00:17:51] HC: Yes. There’s a ton of open source, like you said, for both for geospatial and AI and the intersection of. We’re all shooting for the same goal here. So if we can head there together, I think we’ll get there a lot faster and more efficiently.
[00:18:08] DB: Yes. That’s amazing. One of my DevOps engineers, they put in the ChatGPT. I said, “How much would it cost for me to build a ChatGPT?” His response was, “If you got about $100 million and a team of 200 leading AI engineers and seven years, you can build a ChatGPT.” That’s great. I’m glad that open AI exists, and they’re starting to make some of that available. They’re also looking to monetize it now.
But there’s a lot of work in what we do. This new dataset that’s coming along that’s from satellite imagery, we have petabytes and petabytes of data that doesn’t have ground truth. We use it for massive challenges. We’re supporting the people of Ukraine with the Ukraine Observer and then providing satellite data at over 20 humanitarian aid organizations and nonprofits to do [inaudible 00:18:58] before they run missions to get people out to deliver food. When disaster strikes, open data is available for satellite imagery, the platform.
So the challenge, that’s great. But the challenge is there’s tons of training data from image net of CAT on the Internet and other things. There’s very little labeled training data for earth observation. Where really some of these big global challenges, we need more training data to build better models to meet the needs that we’re seeing globally.
[00:19:32] HC: You need more training data, and you need to think carefully about what types of models and how to train them based on the lack of labeled training data. So both of those come into play as challenges with anybody tackling satellite imagery. Finally, where do you see the impact of Astraea in three to five years?
[00:19:51] DB: Yes. I think in three to five years, I really believe our platform and the work we’re doing in a platform like ours will be very important as the world and businesses and within it are focused more to sustainable development, making better decisions about the planet we live on. You can only do that with incorporating the impact of things you are doing, not just within the four walls of your organization but in your local community and then for manufacturing and some of these others more globally because it impacts everybody across the world, sometimes, and what you’re doing.
So we’re going to continue to provide valuable insights and solutions to help our customers and partners make better decisions about the planet that we live on. As a benefits corporation, that’s core to us. We make sure that we continue to make – we have our commercial federal business. Then we have the [inaudible 00:20:48] and work with all the nonprofits and others that are really working and trying to provide our platform and stay at low to no cost to them to work on these really important problems.
I hope that we can be a part. I know that we’re not going to solve it. It’s going to take everybody to work on climate change and the impacts of it. We’re really looking forward to helping be a part of that solution as we drive to the next 5, 10, 15 years. What it’s going to take to get climate back to where it needs to be.
[00:21:21] HC: This has been great, Daniel. Your team at Astraea is doing some really interesting work for earth observation. I expect the insights you shared will be valuable to other AI companies. Where can people find out more about you online?
[00:21:33] DB: Sure. You can go to astraea.earth, E-A-R-T-H. That’s our website there. Then you can also find us on Twitter at @AstraeaInc. Then, of course, we have a LinkedIn page as well at Astraea.
[00:21:48] HC: Perfect. Thanks for joining me today.
[00:21:50] DB: Thanks, Heather. It’s been a pleasure speaking with you about satellite imagery and all the cool things we can do with it.
[00:21:57] HC: All right, everyone. Thanks for listening. I’m Heather Couture, and I hope you join me again next time for Impact AI.
[00:22:07] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. If you’d like to learn more about computer vision applications for people and planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.