In this episode, we discuss what it means for AI to be trustworthy and Yiannis explains the process by which Code4Thought evaluates the trustworthiness of AI models. We discover how biases manifest and how best to mitigate them, as well as the role of explainability in evaluating the trustworthiness of a model. Tune in to hear Yiannis’ advice on what to consider when developing a model and why the trustworthiness of your business solution should never be an afterthought.
- Yiannis Kanellopoulos’ background; how it led him to create Code4Thought.
- What Code4Thought does and why it’s important for the future of AI.
- What it means for AI to be trustworthy.
- How Code4Thought evaluates the trustworthiness of AI models.
- Yiannis shares a use case evaluation in the healthcare sphere.
- Why Code4Thought’s independent perspective is so important.
- Yiannis explains how biases manifest in AI technology and shares mitigation strategies.
- The role explainability plays in evaluating the trustworthiness of a model.
- Why explainability is particularly important for financial services.
- Simultaneously optimizing accuracy and explainability.
- What to consider when developing a model.
- The increasing demand for trustworthy AI in various sectors.
- Yiannis’ advice for other leaders of AI startups.
- His vision for Code4Thought in the next three to five years.
[0:00:02] HC: Welcome to Impact AI, brought to you by Pixel Scientia Labs. I’m your host, Heather Couture. On this podcast, I interview innovators and entrepreneurs about building a mission-driven, machine-learning-powered company. If you like what you hear, please subscribe to my newsletter to be notified about new episodes. Plus, follow the latest research and computer vision for people in planetary health. You can sign up at pixelscientia.com/newsletter.
[0:00:32] HC: Today, I’m joined by, Yiannis Kanellopoulos, founder and CEO of Code4Thought to talk about Trustworthy AI. Yiannis, welcome to the show.
[0:00:41] YK: Thank you very much, Heather. Thank you for the invitation.
[0:00:44] HC: Could you start by sharing a bit about your background and how that led you to create Code4Thought?
[0:00:48] YK: Oh, yes. I’m Yiannis. I’m 46 years old. I have lots of experience in, let’s say, more than 15 years, actually, in assessing the technical quality of large-scale software systems like traditional software systems. I’ve been doing that since 2007 or something like that. Prior to that, I was doing my Ph.D. in the AI of data mining and creating algorithms for analyzing lots of volumes of data. I had, let’s say, somewhere in 2008, 2009, I had, let’s say, there was an incident that made me think, “Okay, how can we control algorithms?”
I want to issue credit cards, the bank, I made an application, the bank rejected my application that didn’t give any explanation. Then I submitted the same application to another bank. Then they gave me two credit cards, actually, with a really high credit limit. The question, when I asked both banks on why the first one rejected me and why they gave me so much credit, both, they, let’s say, denied to give me any explanations.
Around that time, I started thinking that, “Okay, we are in the measure of algorithms already. We need to protect ourselves.” Long story short, in 2017, working already with really large organizations in telecoms, banking, healthcare sector, and assessing their traditional systems, let’s say. The ones that they have millions of lines of code behind. I started asking myself. Okay, now it’s time to think of how we can evaluate, how can we assess not only those systems, but also systems that are more algorithmic systems that have way less lines of code behind them, but they process millions of transactions. Let’s say, really large quantities of data, and they are probabilistic and the decisions, the outcome of those systems may be way more critical for our lives.
If you have, let’s say, a core banking system, if the system works in a wrong way, the worst-case scenario is for you – the money in your account won’t be, the amount won’t be correctly presented. Somehow, you’re going to get the right amount and your fortune is not in danger, but if a bank denies a credit to you, then this will create problems, personal life, business life, and so on. That’s why we created Code4Thought.
[0:03:02] HC: What all does code for thought do? Why is this important for the future of AI?
[0:03:07] YK: What we do is that we are building technology for testing and auditing AI systems. Around this technology, we provide a series of services, we call them AI audits, or when there is a transaction when a company acquires another company that has an AI component, we do the so-called AI Due Diligence. What we actually do is to help organizations not only improve, let’s say, the quality of the decisions of the system, but in general, to improve the trustworthiness of the AI systems or as our mission statement says, we want to make technology trusted and thoughtful.
[0:03:47] HC: What does it mean for AI to be trustworthy?
[0:03:49] YK: Well, there is lots of discussion about it, right? There are legislations that are being voted, that are being in the United States and also in Europe and in Canada. Everybody has their own interpretation about trustworthiness in practice because our work is focusing on the AI system itself. It means the following. It means that the organization around the system will follow some best practices when it comes to the governance of the system and some risk management processes around it.
At the more technical level, it means that the system needs to be tested for bias. We know that it works in a fair way and it’s transparent. There is an explanation they can go, the organization who builds the system performs explainability analysis to understand the workings of the model to build. Third and very important is the safety and security of the system or as we call it, the robustness of the system that we can test with adversarial attacks and see whether the system is acceptable for manipulation for the malicious users, this kind of attacks.
[0:04:57] HC: How would you approach evaluating the trustworthiness of a model? Does it vary depending on the application you’re in, or is there a standard approach?
[0:05:05] YK: Well, the answer is yes and no. I mean, we do follow a structural approach to do the evaluation. We do follow a series of uniform analysis or measurements to, let’s say, test bias or to see, to perform some explainability analysis to a system or test its robustness. But the way we are going to define certain aspects of the testing and the auditing are very much dependent on the context.
Let’s say a system in the healthcare sector, we’re not going to treat it the same way, although we’re going to use the same test, the interpretation of the results or the threshold of the results will be different compared to a system, let’s say, in the logistics sector, which may be not of that high risk, which the existence of bias might not be that important compared to a system in healthcare or a system in the financial sector.
[0:05:58] HC: For one of those examples, whichever one is easier to talk through, could you maybe walk through the types of things that you’re evaluating for specific use case?
[0:06:08] YK: Well, a very nice use case I have. It’s from the healthcare sector when we were evaluating a deep learning model, which was actually identifying whether a patient at the hospital has fallen off the bed or not. This is a problem, I mean, especially in nursing homes in Europe, where there are several lots of elderly people there, especially at night. The nurses are not sufficient to supervise all those patients. Usually, there are several solutions that have been tried in this field, either with a hundred of things, sensors, several devices, but none of them seems to be working as expected.
There was this company in the Netherlands that they’re building an algorithm that it can run on the edge. Let’s say, of the edge computing, that they monitor the patient in their room. Then if they fall off the bed, they give an alert to the night shift. Then somebody is going there to check what’s happening with the patient. They’ve got a very interesting project. Deep learning amongst data sets, mainly made out of streaming videos. There are intervals where two things – actually, all those things like that, the existence of bias, their experimental analysis and robustness were very important, because A, you want to make sure that the algorithm and the model is being fair, is treating male and female patients the same way.
What we found out actually was that, because it wasn’t properly trained, the algorithm could have predicted their own result. The likelihood of predicting their own result for a woman was twice as much compared to a man, but it was a matter of training, so the team really fixed it. The second part was by running some also we call it explainability analysis. We tested it and we wanted to make sure that the algorithm makes the right decision. if I see that the patient is lying on the floor, the right pixels or the right parts of the picture are playing an important role for decision and not the background so much.
We wanted to, let’s say, avoid the so-called effect that the model is very good in judging by the background and not by, let’s say, the center of the image. The third was the safety of the system. If, let’s say, somebody was violating the perimeter of the whole system and was able to manage serving pictures or input to the model and was trying to change the results, we wanted to see how easy is for this model to do that.
[0:08:41] HC: Basically, you’re identifying what some of the important characteristics for that use case things should be. Things like, it should be fair and unbiased. It should work in the scenarios, it needs to work in and you’re evaluating those things, make sure that there isn’t a known problem with the algorithm that the developers need to fix.
[0:09:00] YK: Yes. We found the – a term, I mean, we had discussion with some other, let’s say, peers and what we do and what we are doing, we call it the last mile analytics. There are similar – let’s just say that there are lots of tools out there, open-source tools that you can do bias testing or you can generate your explanations for your model or you can run your adversarial attacks, but what we do at Code4Thought, let’s say, is that we do this comprehensive testing in the end of, let’s say, an AI models development pipeline. A team may be using their own tooling to test whatever they want, but in the end, it’s us that take, let’s say, the model and a known data set for that model to test it in a comprehensive way with our own thinking, which is, let’s say, independent from the team that developed a model.
This thing we call it the last mile analytics, so just before your model is going to be deployed to production, we do the latest check by combining the results from all of these aspects as discussed. The goal is to try to identify the weaknesses, the risky points for the model before these are manifested to production, of course, and for the team to be able to fix them.
[0:10:14] HC: I imagine that that independent perspective is very important. When you’re developing machine learning solution, your head is very much in the game, and it’s very easy to have blind spots because you’ve been working on the same thing for months or years at a time. That independent perspective that you bring, I suspect is very important to this analysis.
[0:10:36] YK: Yes, for two reasons. The first one is what you just said, the tail is like, inside the box all the time, but also the team that produces an AI system, they are tasked to solve a business problem. They’re optimizing for solving this problem. They’re not being optimized to test the model adequately or to identify corner cases or to ensure that the model is working properly and can be trusted. These guys, they try to solve essentially and primarily the business problem.
We, our mission is different. We are being tasked to identify the weaknesses and the weak points of the system. Then help the team to improve them. We come with two different perspectives, but essentially everyone was the same thing. That is the nice thing working with our clients that we’re not there to just test and audit. We’re there for helping them improve their model. Eventually, both sides are winning in the end. We are happy because we found that the team can improve and the team has implemented them. Also, the team learned something new out of this process, of course. They are in a better position to present their work and be also themselves A, more proud, and B, more confident about the solution they’re offering to the world or their clients.
[0:11:54] HC: You mentioned bias, as one of the things that you’re looking for. There’s a much greater awareness of bias in AI over the last couple of years. In medical applications, it can come up when a particular subgroup, perhaps a racial group or gender or age is underrepresented in the training data. In other applications, maybe in earth observation, it could be because your training data overrepresents a certain part of the world. You expect your model to work everywhere in the world. Perhaps, it’s less accurate in some places. What are some common ways that bias can manifest in the types of projects you’ve been involved in? What strategies do you recommend for mitigating it?
[0:12:33] YK: Yeah. First of all, especially for this type of biases, let’s say the social biases over the ethical biases, there are certain metrics one can use like the disparate impact ratio or the equal opportunity ratio. For with social, there are some thresholds defined that you can utilize to identify the existence of potential biases in your data set. Now, that’s one thing. Now, how to mitigate it? It depends on also how much access your client can give you to their data. Let’s just say that the nice thing with these two metrics is that you don’t actually need to know the ground truth of the model. You just need to know the distribution of the decisions. Then you can judge whether there is bias or not.
Let’s say, if, in our experience, if the teams that we work with, if our clients share more information with us like the ground truth also of certain decisions of the data, then we can do a much more thorough and deeper work and identify the root causes. Then be able to suggest mitigations. Now, also depending on the team’s involvement for constructing or developing an AI model, of course, the mitigation mechanisms differ. If a team, they build the whole system themselves, it’s easier to mitigate the bias, because they have a saying on the data collection, on the training data, on the weights within the model, so they can calibrate and fine human things.
The less, let’s say, control, let’s say, the team has on the model. The more difficult it’s becoming to fix any bias problems. Of course, you can always, even if you don’t have, let’s say, built with the model user, you’re just utilizing it, then the hard way is to identify problems, there are ways to mitigate them. For instance, try to create better training data sets or more. Let’s say, more balanced data sets. Nevertheless, depending on – and also the stage that you are with your model, when you build it or it’s already in production, then you can mitigate it accordingly.
[0:14:37] HC: I want to go back to another topic that you mentioned earlier, which is explainability. What role does explainability play in evaluating the trustworthiness of the model?
[0:14:47] YK: Well, I think it plays a very important role. For instance, we have cases where the clients want to understand how the model was thinking, or especially, the most common requirement we see is that our clients want us to test their model and see the grid that the background of an image plays an important role in the decision because if let’s say, the background plays lots of roles, it means that your model actually is not working properly. It is this typical case with the wolves in the snow that the model was predicting with high accuracy, the snow and not the wolf, in the end.
I don’t know if you know the story, it’s a very interesting story for computer vision models, but this is a common thing that we see. One can say that by performing an explainability analysis, you can use it to essentially debug the way your model works. In the financial services, the explainability is also rather important, because if, let’s say, you are denied credit, at least you have the right to know the reasons. Also, to understand how you can improve your credit rating, let’s say, and be able to secure some credit in the future. I think it’s essential.
[0:16:00] HC: Creating a model that is both explainable and accurate, those can be competing objectives sometimes, especially with deep learning where models are mostly black box. There certainly are ways to explain parts of them, but they are largely black box, you’re mostly optimizing some accuracy-type objective. How do you deal with those competing needs for accuracy and explainability?
[0:16:26] YK: Well, if you ask me, yeah, my point of view is a little bit theoretic in a sense, because I don’t think this should be considered as competing. I don’t think that this’ll be a dichotomy per se. Of course, if you choose to, let’s say, build a solution using a deep learning model, instead of line integration one, you may lose transparency, let’s say, and you gain accuracy. But in the end, if let’s say, you want to build something that your clients, or your relate, or your internal others will trust, you need to set up the proper mechanisms in place.
Even if I have a deep learning model, there are ways that I can explain the decisions of that model, and show them also to non-technical people. Thank God, we have several libraries out there. There are tools that one can use for post-hook explanations over to assess the decisions of the system makes. Nowadays, I think if we talk about the dilemma or the dichotomy from a technical point of view, say it is not. From a network point of view, of course, if you have a deep learning model, and then you need to set up an additional mechanism to explain its decisions, that’s an overhead, of course. But at the end of the day, you will need it in order to be able to mitigate any problems your, let’s say, model might create. Technically, we do have solutions nowadays.
[0:17:46] HC: Would it be fair to say that the important thing to consider is what result you need to produce, who needs to be able to look at it, what type of output they need, is not just a binary decision in a lot of cases, but depending on who’s going to be looking at it and how they’re going to assess those results. The explainability part or identifying the important part of the input data or whatever aspect of it that’s important for that result is the thing to consider when you’re first developing your model.
[0:18:14] YK: I think you need to define your stakeholders, your end users, the people who are mostly concerned about the explanations, and how you’re going to represent those explanations to the people. The more – let’s say, the less technical expression is the simpler the explanations shall be. In the end of the day, I think organizations need to realize that the more trustworthy the systems are, the more things this business value of those systems, because those systems are easier to be adopted by users, to be trusted by users. In the end, you can, let’s say, sell more or achieve more from a business point of view with the system that you have set up all those mechanisms in place.
I agree with you, trying to identify the audience for the explanations and the right type of explanations for the right type or for the right type of personal role is time-consuming. Let’s say, it’s an important step towards making your system being trusted by your clients and peers.
[0:19:13] HC: Yeah. The aspect of working backwards from the end goal to figure out how you should develop and what the important characteristics of it are. I think that’s important.
[0:19:22] YK: No, I agree.
[0:19:23] HC: Trustworthiness is important for all AI and probably even beyond AI, but do you think there’s a greater demand or a greater need for trustworthy AI in certain verticals?
[0:19:34] YK: Yeah. That’s what we see, actually. You hear more and more those like healthcare or automotive, so sectors that the models make more life-critical or life-changing decisions. These are the sectors that people are asking for more transparency, more trustworthiness, more, let’s say, fairness. But eventually, we see that, because as I said, we have this survey, which is called AI Due Diligence. We also see that investors when they buy an asset, an AI asset, or they want to acquire an AI asset, they are concerned about those things.
Recently, we did a project in the [inaudible 0:20:13], where a company was acquired in the AI started from the logistics sector. You can say that they had a model which is not of high risk. It was classifying alerts coming from certain types of vehicles, but in the end, even the investors were interested to know how this business model is working properly, is there any type of statistical bias in the model that we should be concerned with? Is it transparent enough? How easy is it for somebody to compromise the model? We start seeing this trend, or we start seeing this need because I don’t like talking about trends. I don’t like talking about needs, actually. We see these needs becoming more and more or manifesting themselves either in really highly regulated sectors or even in medicine acquisition show of digital assets, which are based on AI technology.
[0:21:03] HC: That’s interesting that even investors are looking for this now. There’s definitely a greater awareness in the AI community overall, in academia and in the industry about trustworthiness and bias and all the different aspects of it. It’s good to hear that other areas of the industry, other stakeholders are thinking about it too, not just as technical people.
[0:21:26] YK: Yeah. I agree with you, Heather. I think that within the next two or three years, we will see more and more of these examples. We will see more investors asking for this type of due diligence. We will see more companies interested in knowing what regulators or what the legislations are dictating, so they will try to comply with them.
[0:21:46] HC: Is there any advice you could offer to other leaders of AI startups?
[0:21:50] YK: Well, yeah. I don’t know who I am to give advice, but nevertheless, I will try to. I think that everyone who is in the AI domain and they’re building their systems, I think they should don’t try to optimize only the business problem. The quality of your solution, the trustworthiness of the solution, should not be an afterthought. I think teams, organizations, they should integrate these aspects from the start, from the beginning of the development of their system. I’m very confident that in the end, this will pay off also from a business perspective.
[0:22:27] HC: Finally, where do you see the impact of Code4Thought in three to five years?
[0:22:30] YK: That’s a really nice question. Personally, I’d like to see us as becoming the state-of-the-art solution for what we like to call ourselves last mile analytics in AI audits in both sides of the Atlantic. Euro, UK, US, and Canada.
[0:22:46] HC: This has been great. Yiannis, you and your team, Code4Thought are doing some really powerful work for responsible and trustworthy AI. I expect that the insights you’ve shared will be valuable to other AI companies. Where can people find out more about you online?
[0:22:59] YK: I guess they can visit our website, which is code4though.eu or our LinkedIn page, Code4Thought. We also have a YouTube channel, but you can find everything in our website, which is code4thought.eu.
[0:23:11] HC: Perfect. I’ll link to all of that in the show notes. Thanks for joining me today.
[0:23:14] YK: Thank you very much, Heather, for the very nice, interesting discussion and for the invitation.
[0:23:19] HC: All right, everyone. Thanks for listening. I’m Heather Couture. I hope you join me again next time for Impact AI.
[0:23:28] ANNOUNCER: Thank you for listening to Impact AI. If you enjoyed this episode, please subscribe and share with a friend. If you’d like to learn more about computer vision applications for people in planetary health, you can sign up for my newsletter at pixelscientia.com/newsletter.