Climate change is one of the most pressing issues of our time, and today’s guest, Ankur Garg, and his team at BlocPower are using machine learning technology to mitigate it. BlocPower is a climate technology company that is focused on making buildings in low and middle-income areas more environmentally friendly. Their area of expertise lies in developing products and services to lower or eliminate the barriers that prevent access to energy efficiency and electrification retrofits. And this all starts with gathering, checking, annotating, and understanding enormous amounts of data (BlocPower currently has over 40 terabytes of data in its data lake!)
In this episode, Ankur talks about the innovative ways in which BlocPower deals with its data, the challenges that they face when it comes to the size and scope of its datasets, why machine learning technology is central to the work they do, and how they measure the impact of their technology.
- Ankur’s career journey prior to joining BlocPower.
- Why Ankur decided to join BlocPower.
- The inspiring work that BlocPower is doing to contribute to solving the problem of climate change.
- The central role that machine learning plays in BlocPower’s approach.
- Ankur gives examples of some of the different types of machine learning models that BlocPower uses.
- The size of BlocPower’s data lake and the types of data stored within it.
- BlocPower’s innovative approach to annotating data.
- The importance of high-quality training data sets in the machine learning space.
- Challenges that Ankur and his team face when training machine learning models on their core dataset.
- Technological advancements that have allowed BlocPower to achieve what it has.
- How BlocPower measures the impact of its technology.
- What Ankur believes the future holds for BlocPower
“Climate change is one of the primary problems of our generation, and BlocPower is making a huge dent in solving that.” — Ankur Garg
“Machine learning really excels at ingesting huge volumes of data and to be able to infer key relationships between these data points to come up with an optimal output or a solution.” — Ankur Garg
“Labeling the data and annotating is extremely critical. If your training data set is not of a good quality, no matter what algorithm you use, it won't really perform well.” — Ankur Garg
“You need a lot of high-quality data for machine learning and artificial intelligence to be productive.” — Ankur Garg
Ankur Garg on LinkedIn
BlocPower Email Address
[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven machine learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.
[00:00:33] HC: Today, I’m joined by guest Ankur Garg, Director of Data Architecture and Analytics of BlocPower, and we’re going to talk about green home and building upgrades. Ankur, welcome to the show.
[00:00:44] AG: Thank you, Heather. It’s a pleasure to join your podcast.
[00:00:47] HC: Ankur, could you share a bit about your background and how that led you to BlocPower?
[00:00:51] AG: Sure, yes. So I have been working in the areas of data analytics and machine learning for about 15 years now, sort of equally split between the financial services industry and clean technology space. Prior to joining BlocPower, I was working for a big financial services company. I was leading the machine learning team there. When BlocPower reached out to me, I was just really inspired by the kind of work that BlocPower is doing.
Climate change is one of the primary problems of our generation and BlocPower is making a huge dent in solving that. So I felt really inspired by just looking at the vision and mission of the company and the kind of work that BlocPower is doing in that space, particularly in the low and moderate-income households. I just thought it’s a great opportunity for me to utilize my skills. Data is central to solving any of the problems, particularly a problem like climate change, which is so complicated, right?
So I just thought it makes a lot of sense for me to make that move. I’m glad that I did that because I think we are doing some incredible work over the past two years since I’ve joined BlocPower.
[00:01:58] HC: So what does BlocPower do, and why is it important for sustainability in the environment and fighting climate change?
[00:02:05] AG: Yes, yes. Sure. So BlocPower is a Brooklyn, New York-based climate technology company that is making buildings throughout America smarter, greener, and healthier. Since our inception in 2014, we have helped thousands of low and moderate-income building owners, tenants, and building managers in more than 24 cities across New York State, California, New Jersey, Wisconsin, Massachusetts, and Washington, DC to understand the unique possibilities of energy efficiency and renewable energy retrofits of their buildings.
We have developed unique expertise in serving underserved communities and diverse communities, understanding the barriers that exist to access energy efficiency and electrification retrofits, and developing products and services to lower or eliminate those barriers.
[00:02:52] HC: What role does machine learning play in this technology?
[00:02:55] AG: Machine learning is very important, right? The reason is the problem that we are trying to solve is extremely complicated, right? So we are basically trying to electrify the buildings. Most of these buildings are pretty old. They have been using natural gas or oil for their heating needs in particular. Usually, in most places, gas is cheaper than electricity. Like if you were to – if you talk about electric vehicles, gasoline is usually pretty expensive, right? So gradually, in terms of the financial aspects, there is an incentive for you to switch from gasoline to electricity. But in the case of buildings, because natural gas is still cheaper, it’s more complicated from a financial standpoint, right? So the financial aspect is one of the aspects.
The second, in general, there are a bunch of things that could be done inside a building. But you really need to assess the need of a given building. Like how does the envelope of a building look like, right? What are the characteristics there? What kind of heating or cooling systems do they have? What kind of occupancy levels are there in a given building, right? So there’s just too much data that you have to look at in order for us to come up with an optimal solution, right? That’s the role of machine learning, right? Because machine learning really excels at ingesting huge volumes of data and to be able to infer key relationships between these data points to come up with an optimal output or a solution, right?
Therefore, we are using machine learning just to analyze a huge volume of data that we have on a lot of different aspects of this problem and to come up with predictive models to identify, for example, the buildings that would give the most bang for your buck, if you invest your money to retrofit that building that would result in the reduction of greenhouse gas emissions to a significant level and a whole lot of other problems.
[00:04:47] HC: So what exactly are these models trying to predict? Is it rank order of different buildings of which ones are best suited for retrofitting? Or is it something more specific?
[00:04:57] AG: Yes, yes. So we have a few different models that we use. One of the models that we use is called energy efficiency potential, and it’s a classification model which analyzes the building stock in a given geography. It looks at a bunch of building characteristics. Based on that, it identifies which are the buildings that are most highly suited for us to go in and do energy efficiency retrofits.
There is another model that we use. Actually, a whole basket of models that we use to do sort of feasibility analysis of different types of upgrade measures, right? So what’s the feasibility for us to go in and install an air source heat pump, for example, or for us to be able to do weatherization and air ceiling kind of work, or to do a solar installation, right? So there are a whole bunch of models that we use just for that.
We use another model that looks at demographics of the leads, right? So it’s a lead scoring model. It’s a regression model that essentially scores leads, like looking at various demographic attributes of a whole host of population in a given geography where we are managing a program.
[00:06:06] HC: What are the different types of data that these models work with? Does each of the models you mentioned there have a different type of input? Or are there some similarities, and what are some example data types?
[00:06:17] AG: Yes, yes. So that’s a good question, right? When it comes to data, at this point, we have over 40 terabytes of data in our data lake. We have essentially created a digital twin of every building in the US. There are more than 130 million buildings in the US, and we have information about a bunch of characteristics such as the square footage, the year that the building was built, what kind of heating and cooling systems are installed in a given building, how many rooms, how many number of floors, right? So a whole host of data around that.
We have a huge data set just looking at the permits, right? So when you do any kind of any of these upgrades into these buildings, you have to apply for a permit, right? This permit data is publicly available. So we have over half a trillion rows worth of data for the last 20 years of these permits that have been requested across the US. All of that data is in natural language. It’s essentially a description, and we are using a natural language model to essentially extract the most useful nuggets of information from that data.
We have a lot of IoT data, right? Because once we deploy a project, we usually deploy a smart device. Usually, a thermostat gives us a minute level. It’s an interval data, right? So minute level information on the temperature within a building and the adjustments that are being made. What kind of energy is the equipment consuming? That data is helping us to predict if a system is in need of maintenance or if it’s about to break down so that we can serve our customers better, right? So those are some of the broad categories of data. Then we, of course, use utility bills quite a bit just to understand the energy profile of a given building.
[00:08:00] HC: How do you gather all this training data? Can you automate parts of the process, or do other parts of it require human interaction?
[00:08:08] AG: Yes. So gathering, for example, the utility data, right? We use an API. Once we get the approval from the end user, we connect to the utility. That data is automatically ingested. We stream the data into a data lake. The same goes with the IoT, right? Once we have set up the ingestion pipeline with the IoT devices, the data automatically flows into our data lake.
But there’s a lot of data that we have to kind of – over a period of time, we have looked for it through a lot of open source data portals. For example, New York City has a big open data initiative, and we have to tap into some of these platforms usually via APIs to source all of that data, aggregate all of that data, which in itself is a huge task, right? Because before you could utilize the data for predictive analytics or machine learning, you have to curate all of this data, right? That in itself, I would say it’s 80% of the effort just to source this data, engineer this data, validate, and do quality checks on it, and make it available for the data scientist to then infer it, feature engineer it, and ultimately try to fit machine learning models, and evaluate the performance.
[00:09:16] HC: Do you need to annotate the data? How do you assign the labels that you’re trying to predict? Is that part of the automated process? Or does it stop there?
[00:09:23] AG: Yes, yes. So for that, usually, the approach is we actually have to use more machine learning models to annotate our data. So a good amount of training data is available from the thousands of projects that we have delivered over these years across various cities, right? So there we have a very high-level granular information on the building systems, what was installed in a given building, what did we go in and retrofit, how is the system performing, right? So we have those data points from a period of time, and that serves as the training data.
But you need a lot more data for you to scale these models. For that, you have to be innovative, right? For example, one thing that we recently did was we flew a drone over a city. The City of Ithaca has a target to cut down their greenhouse gas emissions, and we are sort of managing that program for the City of Ithaca. So we literally flew a drone with an infrared camera, and the output was thermal images for about three blocks of the building. We were doing it as a proof of concept.
Then we are using computer vision and machine learning to identify the buildings where you could literally see the heat escaping from those buildings through the envelope, right? So you have a heat map where you could see from the top of the roofs, from the seals, the heat is actually escaping, right? So we are using computer vision.
First of all, we will segment anything to identify the buildings, right? Because in a given snapshot, you could have multiple buildings within the same frame. You could have parking lots. You could have roads, right? So how do you just focus on the building? That was one part of the module. The second part was we created a statistical model because we did not really have a label data in this case to be able to use a machine learning model to automatically identify the buildings with air leakage, right? So we had to come up with a statistical model to identify what are the buildings where you do see an air leakage and from about 2,000, we were able to identify about 70 odd.
Now, these 70 buildings are now serving as the training data set that we are now utilizing machine learning to identify more of such buildings where actually you could see heat escaping from the envelope, right? Then this would become sort of a training data set for when we utilize this approach to flight drones in other cities where we are managing these programs to automatically identify buildings that are in need of retrofit.
This is one example because labeling the data and annotating is extremely critical. If your training data set is not of a good quality, no matter what algorithm you use, it won’t really perform well. Therefore, we spend a good amount of time on it.
[00:11:58] HC: It sounds like you really have to get creative in developing ways to get those labels too, needing their own models, their own modalities of data, just to obtain the labels.
[00:12:09] AG: Oh, yes. Absolutely. The second example, just because this is such an important problem, it’s just so important for you to get a high-quality training data set if you want to utilize machine learning, right? So a second example is we have a lot of permits data, as I said. More than half a trillion rows worth of data. Usually, the descriptions, they are all in natural language.
So we are utilizing machine learning to identify the buildings where there has been a recent retrofit, right? So probably, there’s been a recent upgrade from a traditional HVAC system to a more energy-efficient heat pump, right? Then we are utilizing that output because that gives us at the national level some of the buildings, a handful of them, who have had those [inaudible 00:12:53]. Then that could serve as a training data set for us to then utilize all of the other data points that we have on the buildings to be able to predict which are the buildings that are in need of retrofit, right? That’s a second example.
But what’s also very important is you have to get the domain experts to weigh in on it, particularly for the training data, right? Because we have a pretty good, pretty solid buildings engineering team who reviews some of these outputs. Based on that input, we were really able to identify high-quality training data set that we are now utilizing for performing machine learning.
[00:13:27] HC: What kinds of challenges do you encounter in working with and training machine learning models on your core data sets?
[00:13:34] AG: Yes, several challenges. I mean, just the sheer volume of the data and the amount of computation that you need to be able to train these models, right? That’s to data engineer and train these models. That in itself is incredible. Honestly, I mean, 10 years ago, it would have been extremely difficult for us to do it.
But because today you have the cloud infrastructure available to you, it’s become possible for us to be able to deploy that level of computational power to just to be able to train this data, right? So computational power that you need to train these models, that’s one of the challenges that we have to deal with.
The second is just the diversity of the data that we have, right? For example, we source data from several different sources. I gave you a few examples. There are these open data portals. You have the tax assessment data. You have the permits data. How do you integrate all of that? Usually, you would do that based on the address, right?
An address, as you could imagine, it’s messy, right? Sometimes, an avenue could be spelled out fully. At some of the places, it’s like just AV dot. The address data is really, really messy, right? That’s one of the key challenges for us that we have over a period of time developed a pretty solid algorithm to normalize those addresses and join all of this data together, right? So that’s the second challenge, just the integration of the data across the data sources.
The third is there are holes in this data a lot of times, right? Some of the most critical features that are very useful or are very important for us to do the predictive modeling, they have a lot of missing data, especially for the rural communities. For a city like New York where the city has spent a lot of time on gathering high-quality data, it’s usually not the case. But most of the rural communities have a huge dearth of data. There you have to come up with more models to be able to impute the missing values as well as possible. So that’s the third challenge is the data quality itself and how do you deal with it.
The fourth and a very important one is data privacy, right? We get a lot of data from our customers, and we want to ensure that all of the data is secure on our data infrastructure. So having a pretty solid data governance strategy to manage the data privacy, that’s very critical.
[00:15:49] HC: I think your cloud compute needs partially hit on my next question here. But why is this the right time to build this technology? Are there specific technological advancements that made it possible to do this now, wouldn’t have been possible a few years ago?
[00:16:03] AG: Oh, yes. Absolutely. I mean, as I mentioned, just the availability of public cloud has made it incredibly easy for you to deploy that amount of computational power, right? To give you an example, we sourced about 130 million flat files which essentially had a lot of building characteristics, right? These are flat files, and we had to simulate the energy profile, right? So utilize these building characteristics, integrate this data with the weather data, and then feed it into a software which is called EnergyPlus.
It is widely used by energy modelers to generate energy profiles for the building developed by the Department of Energy. It provides a C++ SDK, right? So we couldn’t use Spark, for example, in this case. So what we literally did was we came up with a solution on Python, and we were able to spin thousands and thousands of containers on the public cloud infrastructure. We were able to essentially use HPC, high performance computing, to be able to scale the data engineering work here, right?
AWS eventually published a case study on how we did that using a pretty novel approach, right? So something like this we couldn’t have imagined before the advent of the public cloud. You couldn’t really have these kinds of resources available on a permanent basis on-prem, right, just because it’s so incredibly difficult to manage the underlying infrastructure. But in the case of cloud, because these are made available as services to you, you could scale it up and scale it down based on the needs, right? That’s a cost-effective way to do it, right? So that’s one.
The second is the machine learning algorithms have evolved a lot in the past 10 years. I mean, recently, ChatGPT became publicly available for usage, right? They are using trillions of features to train these models, so just the sheer improvement in the algorithms and machine learning. If you have high-quality large volumes of data, these models do a pretty solid job without you doing much in terms of hyper parameters stuff, right? Just stand it out of the box. Algorithms are able to do a pretty good job to serve the server needs that you have. I think these two are the big things.
In our case, because we are a climate and clean technology company, we see there is a big awareness when it comes towards climate change. People are becoming more and more aware about it. The government is becoming more and more aware. Recently, the US government passed the IRA, the Inflation Reduction Act, where they’ve benchmarked a good amount of money to tackle this problem. The public, in general, they are becoming more and more aware of it, and they want to do something about it, right? So that kind of awareness has also helped our cause.
[00:18:39] HC: How do you measure the impact of your technology?
[00:18:42] AG: Yes. As I said, we are using machine learning technology in different spheres of our business. For example, when it comes to how our systems are performing over a period of time, we have these IoT devices that very closely monitor how the equipment is performing over a period of time. We have the utility bills to be able to see if the actual usage has gone down or not. So that’s one way that we monitor.
As I said, we have a model to score the leads and convert them into opportunities. So for that, we keep a very close eye on the campaigns that we run. So we generate these leads, and we’ve created a feedback loop that kind of automatically keeps a track of the campaign performance, right? Based on that, we are able to find out how is the model performing overall, over a period of time, right? If out of the thousand leads that we generated, 50 of them are productive, which means they got converted into our customers from the leads, that’s sort of the success rate of the campaign, right?
Then we feed that data back into our models so that they are now trained on these successful leads. Gradually, we see that the performance of the model improves over a period of time, right? So we’ve kind of automated this entire feedback loop where we are continually assessing the performance of the model and retraining the model on a continuous basis so that the performance is up to the mark.
[00:20:07] HC: Is there any advice you could offer to other leaders of AI-powered startups?
[00:20:13] AG: Yes. I mean, look, based on my experience, the technology is really, really powerful. But you have to really understand the problem space. So I think understanding the problem space, what is the problem that you’re trying to solve, and how could machine learning help you to solve that problem.
There are a number of areas where machine learning could be utilized. But there are also a lot of areas where it may be an overkill, right? Because you need a lot of data, and you need a lot of high-quality data for machine learning and artificial intelligence to be productive, right? You, first of all, have to assess the problem space very closely. See what kind of data is available. If it is not available, you really have to invest time, energy, money to obtain high-quality data or to create high-quality data.
Then you, of course, have to invest in the data infrastructure, right? Because as I said, 80% of the effort is to just source this data, clean this data, curate this data. Without a very solid foundation of a very, very strong data stack, it’s simply not possible. So you really have to invest into your data lake, into your data warehouse, into your data governance if you want to utilize machine learning to solve the problem that you are at, right? So that’s pretty important.
The fourth is ensuring that there’s a very solid data governance in place, and we are utilizing AI and machine learning in an ethical fashion. I think that’s incredibly important to realize and understand because increasingly, that’s becoming more of a problem. As we see the large language models hallucinating, there are a lot of concerns around data privacy in general. So having very strong checks and balances on the way we care for the data of our customers and how we put that into use through machine learning. I think that’s very important.
[00:22:03] HC: Finally, where do you see the impact of BlocPower in three to five years?
[00:22:08] AG: Yes. So I think we are on a pretty good track. We are already the program managers for a lot of cities across the US. I mentioned a few within New York State, California, Wisconsin, Massachusetts. We are expanding our footprint, right? So in the next three to four years, we definitely see ourselves as the leaders, and we would have made a huge dent in solving this problem of electrifying the buildings.
The recent money that we’ve got from the IRA, that would help, we are hoping, in a big way for us to be able to make these projects more financially viable. With the power of machine learning and artificial intelligence, we are continually improving the fidelity of our models based on the data that we are getting from the ground, right? So it’s sort of a circle, right? As we do more projects, we collect more data, and then that high-quality data could feed into our models, and our models become more accurate, right?
We see that flywheel turning at a faster pace as we do more and more projects on the ground. That would really help us to scale operations in our ability to identify high-quality projects over a period of time, right? So in the next three to four years, I think we should be able to scale our operations truly at the national level and electrify a bunch of buildings, ensuring a better future for the future generations, and making the lives of disadvantaged communities, improving their lives through some of the work that we are doing, right? Making it healthier for them, improving the overall environment, and helping the climate change.
[00:23:40] HC: This has been great, Ankur. Your team at BlocPower is doing some really important work for climate change mitigation. I expect that the insights you’ve shared will be valuable to other AI companies. Where can people find out more about you online?
[00:23:52] AG: Yes. Feel free to reach out on our website, blocpower.io. You would have all the details there. We are all available on LinkedIn; So happy to connect with any of you if you have any questions. Feel free to email us on [email protected].
[00:24:08] HC: Perfect. Thanks for joining me today.
[00:24:10] AG: Thank you for having me.
[00:24:12] HC: All right, everyone. Thanks for listening. I’m Heather Couture, and I hope you join me again next time for Impact AI.
[00:24:23] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. If you’d like to learn more about computer vision applications for people and planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.