Optimizing Data Center Operations with Vedavyas Panneershelvam from Phaidra

What are the unique challenges of operating mission-critical facilities, and how can reinforcement learning be applied to optimize data center operations? In this episode, I sit down with Vedavyas Panneershelvam, CTO and co-founder of Phaidra, to discuss how their cutting-edge AI technology is transforming the efficiency and reliability of data centers. Phaidra is an AI company that specializes in providing intelligent control systems for industrial facilities to optimize performance and efficiency. Vedavyas is a technology entrepreneur with a strong background in artificial intelligence and its applications in industrial and operational settings. In our conversation, we discuss how Phaidra’s closed-loop, self-learning autonomous control system optimizes cooling for data centers and why reinforcement learning is the key to creating intelligent systems that learn and adapt over time. Vedavyas also explains the intricacies of working with operational data, the importance of understanding the physics behind machine learning models, and the long-term impact of Phaidra’s technology on energy efficiency and sustainability. Join us as we explore how AI can solve complex problems in industry and learn how Phaidra is paving the way for the future of autonomous control with Vedavyas Panneershelvam.

Key Points:

Hear how collaborating on data center optimization at Google led to the founding of Phaidra.
How Phaidra’s AI-based autonomous control system optimizes data centers in real-time.
Discover how reinforcement learning is leveraged to improve data center operations.
Explore the range of data needed to continuously optimize the performance of data centers.
The challenges of using real-world data and the advantages of redundant data sources.
He explains how Phaidra ensures its models remain accurate even as conditions change.
Uncover Phaidra’s approach to validation and incorporating scalability across facilities.
Vedavyas shares why he thinks this type of technology is valuable and needed.
Recommendations for leaders of AI-powered startups and the future impact of Phaidra.

Quotes:

“Phaidra is like a closed-loop self-learning autonomous control system that learns from its own experience.” — Vedavyas Panneershelvam

“Data centers basically generate so much heat, and they need to be cooled, and that takes a lot of energy, and also, the constraints in that use case are very, very narrow and tight.” — Vedavyas Panneershelvam

“The trick [to validation] is finding the right balance between relying on the physics and then how much do you trust the data.” — Vedavyas Panneershelvam

“[Large Language Models] have done a favor for us in helping the common public understand the potential of these, of machine learning in general.” — Vedavyas Panneershelvam

Links:

Vedavyas Panneershelvam on LinkedIn
Phaidra

Resources for Computer Vision Teams:

LinkedIn – Connect with Heather.
Computer Vision Insights Newsletter – A biweekly newsletter to help bring the latest machine learning and computer vision research to applications in people and planetary health.
Computer Vision Strategy Session – Not sure how to advance your computer vision project? Get unstuck with a clear set of next steps. Schedule a 1 hour strategy session now to advance your project.

Transcript:

[INTRODUCTION]

[00:00:03] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research in computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.

[INTERVIEW]

[0:00:33.8] HC: Today, I’m joined by guest Vedavyas Panneershelvam, CTO and co-founder of Phaidra, to talk about operating mission-critical facilities. Vedavyas, welcome to the show.

[0:00:44.7] VP: Thank you, Heather. Great to be here in this podcast, looking forward to it.

[0:00:48.7] HC: Could you share a bit about your background and how that led you to create Phaidra?

[0:00:52.2] VP: Sure. Yeah, my background was in computer science and especially distributor computing, became interested in the – in ML, especially after the 2012 ImageNet competition when AlexNet like, you know, it just destroyed the – every other pattern recognition like you know, teams that were in the ImageNet like you know, by better like in a 10 person in error margin and that’s why where I go, “Hey, this is a field that I should start paying attention to.”

And started like, you know, just like everyone else at the time, started with Andrew Ng’s Coursera course. Yeah, and then later in 2013, I joined DeepMind, right? And then, now, I’ve started – at the time like, you know, the – all these algorithms wherein like in other where they’re not taken advantage of like, you know, distributed computing. So, that was a good entry for me to like you know, scale up these algorithms.

A lot of this training they’re done like you know, in a single mission. Like, you know within Google, they were – there was a few people that were trying to experiment at that time in 2013 and like, at Google scale, like you know, training and stuff like in a Jeff Dean and the team, like Google Brain. They started like, you know, they published the CAD paper in 2013. So, that was the only team that was doing at the time.

So, it was good for me to like, you know, be within Google and DeepMind at the time. I also had this distributed computing background and was this new field of machine learning. So, it gave me a chance to get in and get my sense of the field, work on — but I was also learning in machine learning. At the time I was picking it up as I was going. So, first, initially, like you know, I was working on a lot of scaling these algorithms.

And also like, working on games and applying these in a distributed fashion, these algorithms, especially like did unit works and these deep oral algorithms to games and even like, you know, a narrow scope of like autonomous driving at the time, and then in 2015, started working on AlphaGo with the team, and then we had the watershed moment in 2016 and that’s when I realized like, you know, after a year of working in AlphaGo and stuff, like you know, I had a realization that like, you know, search on top of learning with the reinforcement operators are really powerful and this is a framework that, that is really generalizable and scalable as well.

So, I was looking for – in 2016, I was looking for a real-world application that I can take and apply to and that’s when I met Jim Gao, he was one of the other cofounders of Phaidra, and he – he will – Jim came from the – from mission, like you know, or from data center team, and he was like – and he also had a similar vision. He did an initial prototype of applying ML to data center optimization and he had some good results.

So, he was of the opinion that actually applying reinforcement planning is going to be like you know, even more powerful, and yeah, I joined him and he moved on later to DeepMind to start DeepMind Energy team but I joined him as a first member and we wanted to commercialize this, even outside of Google. At the time, he applied the first reinforcement algorithm for data center cooling use case within Google and then he wanted to take this and commercialize it outside.

And as for that exercise, we met other co-founder, Katie, in 2018 who was one of the other – who was like, responsible for commercializing new technologies within Trane and Trane and Google partner at the time to take this technology outside to the compulsion building use cases, and as part of working together, we realized that the potential for these – for this technology is much beyond then of what we were – what the teams were thinking at the time, and that led us to thinking that okay, “Hey, we should start Phaidra and start commercializing this.”

[0:04:51.8] HC: So, what does Phaidra do and why is it important?

[0:04:54.6] VP: Yeah, Heather. So, Phaidra is like a closed-loop self-learning autonomous control system that learns from its own experience. I know, this is a very overloaded sentence with a lot of terms. Say, closed loop they call, it’s we do read and write without human interventions and we do it in near real-time. We do write their near real-time decisions back to each control systems.

So, when I say near real-time, we work on the scale of like, a minute to five-minute frequency, right? And these are self-learning autonomous control systems, autonomous in the sense that we don’t have humans in the loop, which is yeah – and then, the self-learning aspect of it comes from like, you know, we learn from our own experience and improve overtime, right? And today, if you look at the – today, if you look at the market, like today, most of the control systems are static and rule-based, right?

And from the time you commission, they start performing poorly over time, and in spite of these mission-critical facilities being heavily sensitized and digitized, this data is not taken advantage of, and this is what we wanted to address and from the day one aspect like, you know, in the going our system, like you know, our controls, autonomous control, enter our autonomous control system as performant than the existing way of operating but it improves over time by learning from its own experience, right?

And one use case that we heavily focused on today is optimizing the cooling systems for data centers and the data centers, this particular use case, like you know, you – they consume so much of the — data centers like basically generate so much of heat and they need to be cooled and that takes a lot of energy and also, the constraints in that use case is very, very narrow and tight and sometimes, like another thermal excursion can affect the performance of these servers and the accelerator doesn’t need a pulse but in the worst cases, can even cause catastrophic failures. [0:06:55.9] HC: So, reinforcement learning is a type of machine learning. How do you use this to solve the problem and how does that improve operations for a data center, for example?

[0:07:04.7] VP: Sure, yeah. So, like, generally, like MLS go through us and especially reinforcement learning, right? And we want to be creative and we are creating systems that can learn, adapt, reason, and plan and one of the key challenges for agents that take an action and be on with this to reason and understand how they will affect the environment that is being closer than nature, and when you have a system that is being operated by a set of rules, right?

The amount of spaces that you explored is very – it’s just a subset. It’s just a finite subset of it and you can start with supervised learning to bootstrap the system but then you need a sort of like a safe way of trial and erroring over time to explore certain actions that can be much more optimal and you want to be learning from it. With all that exploration, you wouldn’t be able to improve – improve your behavior, right?

Improve your control behavior, improve your policy, and this is where like, reinforcement planning is really good at it that you have a safe framework by which you can explore and then, once you learn it like you can – you start to – in the Epsilon type fashion, you can use it to explore it, right? And take advantage of.

[0:08:26.0] HC: What type of data do you work with? What’s the input and what’s the input of some of these models?

[0:08:30.2] VP: Yeah, sure. Like another data that we work with, and it’s mostly the operations data from the field, Heather, and the sensors that generates the sensors that define the state of the system, right? For cooling systems within data centers, it will be your usual suspects like temperatures, pressures, flow on the power consumption, pump of fan speed, run status of the equipment, et cetera, right? And these are generally the data that we work with.

[0:08:55.8] HC: What kinds of challenges do you encounter in working with operational data like that?

[0:09:00.8] VP: All kinds of challenges, right? Anything like, you know, these are sensors in the physical world and you can expect like you know, all kind of failures, right? From like, you know, your sensors can be miscalibrated or your sensors can be damaged, which is the worst case or like, you know, they’re connected in a local network that can fail and then, like, you know, we establish a read connection to these facilities and that connectivity between our cloud and these facilities, that can fail us for – so, yeah. So, there’s – and there could be encoding issues on the way when the data has been collected in the local network from the sensors or there are a lot of these failures, ranging from whole physical order failures to like, you know, to software kind of like, you know, encoding glitches and stuff, protocol failures and stuff.

But one good thing, Heather and this is you have a redundancy of information, right? Generally, when you have these infrastructures, you just don’t have a temperature sensor. For example, like you know, if you take a data center chilled water plant, you’re going to have temperatures in each of the chillers, and also on the common distribution loop, and also where – when it’s entering the data hulls exit run stuff.

So, when you have this, when there’s a downstream and a upstream sensor, like, you know, when one of the temperature sensor fails, these other temperature sensors, which are close nearby will be highly and heavily correlated with these ones. So, you have the redundancy of information that you can take advantage of and you can impure that, right? So.

[0:10:35.8] HC: So, reinforcement learning enables your system to adapt as time goes on, does this naturally make your models robust to distribution shifts or is this something that you’d have to think about in addition?

[0:10:48.4] VP: Great question, Heather, like, you know, there is a lot that we do to counter that, right? Like, if you spend a lot of time and effort on that. I’m really glad that you asked that, and one thing that liken it – two things that we do, especially like it, in this space liken it to accommodate for the distribution shift. One thing that we need our models to be is working in the regime of causation and not in correlation.

So, how do we do that? We had to like you know, one of the things that we do is like as we have this domain of like in this cooling industry of domain and we need to understand the physics of the underlying domain and also understand what are the cause of variables in it, what are compounding variables, what are the immediate variables. Its going to your cause and inference theory and trying to understand and like you know, how – what are the features that we should be using.

We heavily follow the cause and inference theory over here to pick our features, so that like you know, we keep our models in, in the causation regime. That is one and two is, there is search of like you know, physics false prints from physics that we know about, and we check those in our training process. It could be like an indictive biases and on our network. It could be how we sample certain times or like how does it allegorization loss to a loss functions you know, while we train.

So, the training procedure we could try to like you know, incorporate these, all these physical priors that we understand that is to – and that’s also like generally applicable.

[0:12:22.9] HC: Are these physical variables things that you can anticipate if you are developing a system for a new facility or do you sometimes learn about new variables and new challenges when you need to plan in a new facility?

[0:12:36.1] VP: Yeah, and like great question again. Like, generally, like you know, you would say like you know, all these facilities are snowflake, which is true but for a coin-used case, we need to take care as like a cooling systems or cooling, like you know you’re optimizing for a cooling of a data center system. There is only like usual suspects that are like [ectopins 0:12:56.2] that I used and there’s only like known, like a set of finite rates that you put them together to form the cooling system also.

So, what we had to do initially had to as to liken not to make it scalable, one of the things was like you know, we wanted to define a taxonomy, I mean, an abstraction for us, right? And we want to be working at this abstract level rather than the actual physical system level of what sensors are available. You know, we have an abstraction and a taxonomy and we map it over there to that and beyond that, that is the level at which like you know, we work.

In that case, like you know, while a new facility that is coming in might not be the same in terms of like the physical reality but in the abstraction that we have, they are pretty similar.

[0:13:44.9] HC: How do you validate your models?

[0:13:47.1] VP: So, again like you know, as I said like to the contract the art of distribution, we also spend a lot of time in this, especially when you’re having an agent that is taking actions in a real-world environment and we don’t have access to count the factuals and the dataset, you’re never explored the whole set of possible state spaces. There’s going to be definitely distribution shifts so you should be like a benefit.

So, we understood right from the start like you know, your typical MAE, like mean absolute error, and your means quite an error, and you’re off this course, or know the sole metrics that we should go by and so, we use the domain knowledge to derive metrics and unit test that the model will have the status from but the trick over here is finding the right balance between relying on the physics and then like how much do you trust the data.

And one example that I can give is let’s take these physical systems, let’s take these chillers, right? These chillers you have, you can think of it as like it has like you know, two actions that it can set for it. One is like you know, whether you want to run the chiller or not, and then the second one is what is the temperature of the water that should be leaving the chiller. It’s like your air conditioning, right?

So, the power consumed by the chiller or your air conditioner in this case as well is going to be like directionally different upon how hard you want to make that chiller or the air conditioner work. So, the lower you set the temperature, the more power that’s going to be consuming. That directionality is definitely given, so that is like a monotonic relationship. Between that input variable, there would be a setting, one of the actions, or the set points along with the output that you’re concerned about, right?

And output it more that you’re concerned about but it is a monotonic relationship but you don’t know the magnitude or the slope of the particular line at every operating condition of the operating person. It can have an idea offered but the order you know, the monotonic related really. So, what we can do as like, as the gen break at unit test of stuff is like you know, you can verify whether the model actually is monotonic in nature at every operating regime, right?

You can check that but the trick is if you want to go, “Okay, I know for this particular operating regime, I expect the slope to be this particular.” That is going to be effy like you know, it depends upon as I said, the operating regime does a lot of factors, right? Like you know, what is the chiller’s maintenance cycle at this point in time? How much foiling does it have? What is the outside condition?

And like you know, all these things come into play and that affects that nature and the sole profit. Yeah, so the days that trick in and that but like you know if you – we want to be like you know, using general false principle physics, which are more like you know, these directionality relationships and then for anything when it comes to magnitude, we want to trust the data as much as we can.

[0:16:49.2] HC: Why is now the right time to build this technology?

[0:16:52.5] VP: We started thinking about this like four years back in 2009, 2000 – five years back and the bullish case for it has only like, it has only got even more a lot, lot stronger, right? We definitely see the world in a way like you know, especially these controlled systems, they need to become intelligent and adaptive like you know, and it needs to learn from its own experience.

Like, today, thinking that a lot of these mission-critical facilities, the way they operate are the same way that was programmed like you know, some 10 years back or 15 years back. It’s just to think about it is insane. This system generate so much of data and you can learn from these, from these experiences, and you can improve over time, right? And it’s just insane for us to think that the control systems today don’t learn from its own experience and improve over time.

And, I think like you know, that large language models and they have definitely done a favor for us in helping the common public understand the potential of these, of machine learning in general, right? So, people could – like, before it was part of for people to see this, like you know, can you actually control these mission critical facilities with this kind of a technology but now, it is not that far fetch for people.

So, it’s only become – the other vision that we had five years back, that’s a lot, lot more closer now than it was probably years back.

[0:18:23.9] HC: Is there any advice you could offer to other leaders of AI-powered startups?

[0:18:28.2] VP: I think it’s just one which is just a common thing, focus, right? They could be really, really focused on the particular – on the product or on the use case that you’re working on. It’s easy to get distracted, it’s easy to go in terms of like, investing your time and effort and the, probably even in the algorithms or the complexity of the algorithm that you want to be developing as a machine learning researcher.

But at the end of the day, you should just focus on the impact that it’s going to create to your users. Yeah, finding the right balance but maybe really focused on the impact and making the users happy and I think that would take care of it, in my opinion.

[0:19:09.3] HC: And finally, where do you see the impact of Phaidra in three to five years?

[0:19:12.8] VP: Yeah, so today, we are starting to work in brownfield data centers, right? And when we started our vision was and it still is the future of industry control systems, right? And I think we are focusing at this point in time in data centers. We have also ventured into pharmaceutical manufacturing and district cooling. While you still call these three use cases, we are heavily focused on data centers at this point in time.

And I think we would be restricting our use cases to these three applications at this point in time, especially in the brownfield areas into this point but over time, in two to three years, like you know, I mean, we are saying like you know, the builder of the data centers, there’s new better updated centers coming up in the next five years that is going to double their data center capacity of what we have now.

So, given that, I would say, like, we want to be penetrating the data center market, especially in the brownfield and the greenfield by 40 to 50% and the impact of that, like, you could say like you know, there are a lot of articles that says like the data center power consumption is going to be 10% of the US electricity consumption and by 2030, right? So, by the timeframe, like you know, if they can save 10% of the cooling power consumption, which is like, generally one-third of like what that will be, that would still be like, you know, a decent one to 1.5%.

Given the 40 to 50% of penetration that we have, right? And this is just the start for us and we would further like you know, we would like in the four to five-year timeframe to branch out to store, at least, like researching and starting to branch out to one of the use case. I strongly believe like you know, we’ll be super focused on the data centers for the years to come.

[0:21:05.3] HC: This has been great, Vedavyas, I appreciate your insights today. I think this will be valuable to many listeners. Where can people find out more about you online?

[0:21:14.1] VP: I’m not – I’m not available on the Internet that much but LinkedIn would be a good place and I’m going to spend some time to like, you know, write some of my like you know, some of my thoughts and ramblings and in some – close to something but I think the Phaidra blogs and LinkedIn would be a good place, Heather, for now.

[0:21:33.1] HC: Perfect. I’ll link to those. Thanks for joining me today.

[0:21:36.4] VP: Yeah, thank you, Heather, it was great.

[0:21:38.5] HC: All right everyone, thanks for listening. I’m Heather Couture and I hope you join me again next time for Impact AI.

[END OF INTERVIEW]

[0:21:48.6] HC: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. And if you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter. [END]