Ethical Use of Data for AI

What happens when the data being used to improve customer experiences has unknown or inherent biases? JoAnn Stonier discusses the ethical use of data when implementing machine learning and AI use cases. She also shares her perspective on the steps that organizations can take to eliminate bias from showing up in the data.

JoAnn Stonier
Chief Data Officer
Mastercard
JoAnn C. Stonier serves as Chief Data Officer for Mastercard, leading the organization’s data innovation efforts while navigating current and future data risks. She oversees the curation, quality, governance and management of the company’s extensive data assets. JoAnn and her team design and operationalize Mastercard’s global data strategy, guiding enterprise deployment of cutting-edge data solutions, including advanced analytics and AI, and the development of enterprise data platforms. Her leadership is integral to Mastercard’s push to deepen the strategic value it can provide its merchant, banking and government customers and cardholders through its expanding data-driven products and capabilities.

JoAnn previously served as the company’s Chief Information Governance and Privacy Officer, responsible for global privacy and information governance, and leading regulatory engagement for data compliance.

JoAnn is a recognized and highly sought-after thought leader in emergent data and privacy issues. She has advised industry executives, governments, intergovernmental organizations and NGOs. Currently, she serves on the United Nations Expert Group on Governance and Artificial Intelligence and is Co-Chair of the World Economic Forum’s Global Future Council on Data Policy. JoAnn also serves as a Board Advisor for Truata, a data trust co-founded by Mastercard and IBM, on the Board of Directors and Governance Committee for Hope For The Warriors, a nonprofit veteran service organization, and on the Board of Trustees and Executive & Finance Committees for Academy of Mount St. Ursula, where she attended high school in the Bronx, New York.

JoAnn received her Juris Doctorate from St. John’s University and her Bachelor of Science degree from St. Francis College. She holds memberships in the Bar of the State of New York and the Bar of the State of New Jersey. She is based in Purchase, N.Y.

Read Interview

Speaker 1:
Welcome to Champions of Data + AI brought to you by Databricks. In each episode, we salute Champions of Data + AI, the change agents who are shaking up the status quo. These Mavericks are rethinking how data and AI can enhance the human experience. We’ll dive into their challenges and celebrate their successes all while getting to know these leaders a little more personally.

Chris D’Agostino:
Hi, I’m Chris D’Agostino, and welcome to the Champions of Data and AI podcast. It goes without saying that data is playing an active role in how we experience the world, from the entertainment we consume to our health and financial well-being and other priceless examples, the impact of data is growing. What happens when the data being used to improve our experiences has unknown or inherent biases? In today’s episode, I’m joined by JoAnne Stonier, the Chief Data Officer of Mastercard, to discuss the ethical use of data when implementing machine learning and AI use cases. She also shares her perspective on the steps organizations can take to eliminate bias from showing up in the data. JoAnne, thanks for being here.

JoAnne Stonier:
Thanks Chris. It’s a pleasure to join you and to be part of Champions of Data and AI.

Chris D’Agostino:
All right. So before we get started today, let’s level set our listeners a bit. In February of 2018, you were appointed as the first Chief Data Officer for Mastercard, and before that you were serving as the Chief Privacy Officer. Can you flash forward three years, how has that role evolved in that time?

JoAnne Stonier:
Well, I think if you go back in time, certainly the past year has been an interesting one for all of us. So we’ll talk about COVID in a bit, but when you look back over time, I think there was, my role really evolved because of all of the requirements that came out of GDPR, which was the first real global privacy regulation that had a lot of operational data requirements embedded in it. And my role partly evolved because of that law and because of all of the operational data requirements. And I think we were focused on really improving our data practices at that time, trying to figure out how we were going to navigate a world of compliance while still enabling innovation and navigating the risks that were coming out of the regulatory frame.

JoAnne Stonier:
I think when I look at where we are now, I think we still have all of that in our line of sight and we navigate that really well, but the world of innovation has sped up. I think that data is much more understood and desired as a raw material of innovation, but I think other types of data risks are better understood beyond just regulatory risk, privacy and security. I think issues of data quality are better understood, issues related to AI and bias are understood. So we’re just seeing the world go faster and faster toward data innovation and the related risks and opportunities that bring that. And so my position has really evolved more into a data strategy role. We were always at the strategy table trying to figure out what we could do with data, but navigating that so that we build capabilities, platforms and products that are a little bit more future-proof requires a very strategic lens.

Chris D’Agostino:
Yeah. And as a global company, having come from Capital One where we had a presence, obviously in the United States, but also in Canada and great Britain, GDPR, CCPA were sort of front and center for me as well in my role, leading data engineering and building out the data compliance and data governance tooling. So certainly, I would think that probably in the last couple of years with CCPA coming online, just that being a really big challenge for any organization that’s got to protect customer data.

JoAnne Stonier:
For us, it was just a design constraint. I think we always were leaders in privacy and data protection. When I think back to GDPR, we always had a privacy by design approach. I think that GDPR took a couple of things more global than were before. And certainly since then, we’ve seen things like CCPA laws in Brazil, other countries that have copied some of the elements of GDPR. But the approach we took for GDPR was a global approach. We really took a look and said, why are we going to protect data for one set of constituents, one set of customers and not others? And so we’ve really taken in the responsible data approach, a global approach to what’s the right thing to do for individuals and for our customers with data as we innovate?

Chris D’Agostino:
So you’re a master of data governance by profession, right? But your education is in accounting and law, which I find fascinating. And the last time we spoke, you talked about your creative outlet being in design. So, is there an inflection point across these disciplines where your background in each of these areas helps you be more effective at your role?

JoAnne Stonier:
Well, I like to think there is, right. I think that if you, first of all, I like to learn. I think I’m a lifelong learner. I think the accounting, when you look back at my career, I started out as an accountant and an auditor. And if you think about auditing, if you understand that skillset, you have to understand how things connect and controls. So one of the tools in an auditor’s skillset is flow charting, and that’s one of the skills that you learn in order to figure out how processes work and how controls are applied. Well, that’s not a far leap to understand that that’s how computer systems are created and how data flows are created. So flow charting skills obviously are a skillset that I use today in my job today, but I used back when I was an auditor, because computer assisted auditing was just taking off when I was a baby auditor. And so that that skill comes full circle.

JoAnne Stonier:
My legal skills obviously come into play with all of the advent of different privacy data protection laws, but all sorts of other types of data restriction laws, contractual laws, on soil laws, fair credit reporting act type of laws that we have to comply with, banking laws, in order to use data for innovation. So understanding the restrictions are important when you’re designing systems and solutions. The design skills though, I think are probably some of the most important skills that I have, because they really do help me think through, how do you create a solution that works with all the restrictions? So that’s how I would think of it.

Chris D’Agostino:
So JoAnne, thinking about AI more broadly and how artificial intelligence is applied to a whole range of use cases, whether it’s within Mastercard or outside, can you think of an example of how AI has positively impacted your life specifically in society more broadly?

JoAnne Stonier:
Well, I certainly can because it’s something that we’ve been working in partnership with lots of organizations ever since COVID hit. When you think about the environment we’re all living in, we are doing this on Zoom because we need to be safe and sound while the virus and the pandemic is still raging. I think that AI actually has been used to help sort through all of the data to develop the vaccines. So if you just think about that, that one of the powers of artificial intelligence and machine learning is its ability to process through piles and piles of information to sort out what’s important and what isn’t. And I think, I’m not a health scientist, but from everything I’ve been reading about the vaccines, we would never have gotten to the solutions we’ve gotten to so quickly without artificial intelligence and machine learning.

JoAnne Stonier:
So I think we have to look at that and understand that the sequencing of the DNA and RNA have actually enabled us to get to solutions much faster. So when I’m asked that question, I’m like, Oh my gosh, we are living in this moment right now! Now that being said, obviously at Mastercard, we are not necessarily involved in that moment, but what we are involved in has to do with vaccine passports, which is going to be the next thing that we’re all going to need in order to move around in this world. So connected data, not necessarily for AI, but connected data is also super important in this time. So how do we navigate a world where we need to exchange information in a way that honors the individual? There’s lots of different elements to what artificial intelligence begins, but then how is it put to good use for multiple purposes?

JoAnne Stonier:
So we have the vaccines, but then how do we share information in a way that is privacy sensitive, but also impactful so that all of us can navigate a world where we now need to know kind of status so that we are able to travel and reconnect, see our families again, conduct business in a way that’s safe for all of us. So I think that’s the first use case.

JoAnne Stonier:
At Mastercard, we’ve been using AI and machine learning for fraud and for cyber. I think that those are also ways that are highly impactful to improve our lives, protect us, provide safety and security for our customers, as well as our card holders. All of that I think is incredibly impactful. It’s quieter perhaps than COVID, but I do think that all of those use cases actually improve our lives. But I know that we’ll also talk a little bit about what are some of the downsides of AI and machine learning. We’ve also had to, we’re beginning to understand those over the past year, as we have some of the conversations around the social impacts of data and how we need to make sure that it’s more inclusive as well.

Chris D’Agostino:
Yeah. At Databricks, we’re really proud of the fact that our platform has been a data science engine that’s been driving a lot of the research around some of the drug therapies being developed, the vaccines as well, plus the NHS in the UK is using Databricks to optimize patient care within the hospitals and to work with the trace and track program as well. So we’re applying machine learning algorithms to help ensure that beds are available in ICU and things like that for patients.

Chris D’Agostino:
And so we’re talking, you touched on a bit about sort of that global passport and the ethical use of data. And I know that in 2019, Mastercard introduced a document called the Global Data Responsibility document, and it basically talks about the imperative of making sure that you can hold data safely, securely, utilized in ways that are ethical and compliant and can benefit others. And then of course you want to be able to innovate on top of data sets that you’re holding. Can you share with us a little bit about why did you produce the document? I thought it was fantastic to see an organization like Mastercard as big as you are with as much data as you have on people take that stance of being sure that you’re going to be ethical with the data and how it might pertain to models that are built for algorithm development and things like that.

JoAnne Stonier:
So we developed our data responsibility imperative in part because we thought it was really important that we begin to model for our customers, for our consumers and for actually all of our partners in business as a data ecosystem, really, because that’s what Mastercard operates as we process billions of transactions every day and every year, that we actually explained what our data practices were. Because in 2019, we began to really realize that we are connected by our data, that all of us are connected. And so we needed to not only model our behavior, but also begin to say to everybody else it’s really important to have responsible data practices. And so we began that with really what’s important, which is individuals, that data is very individually impactful and that we believe that individuals have the right to own their data when it’s personal data, that they should understand how it’s being used, that they should have the opportunity to control it. And that, of course, that they should have expectations regarding privacy and security.

JoAnne Stonier:
And we said, well, that’s going to be very nice, but that the response is going to be well, Mastercard, what are you doing about it? And that’s where we came up with our own principles around it, really about privacy and security, which we had dedicated practices too for a very long time. But in addition to that, things like accountability, that we would be accountable for how we were going to use your data, that we would be transparent, and that we would provide controls, and that we would also double down on things like integrity in our practices so that we could innovate. And that in our innovation, we would be mindful of things like AI, and also that we were being transparent that we were going to use data, not necessarily at an individualized level, but at an aggregated level to improve society, which I think was novel in 2019.

JoAnne Stonier:
When you go through a year like 2020, I think people can get behind it a little bit more now after COVID. But it really was our attempt to begin a dialogue with other like-minded organizations as well to say, okay, if we’re going to have these kinds of principles, then what are the practices that go along with them? So how do you provide transparency? How are you accountable? How can you provide explanations? How can you allow individuals to have control over their data, have access to that data? And certainly there are some laws that require it, but how can we go a step further? And so those are the types of things that our data responsibilities represent to us. And then there’s a whole set of practices at Mastercard that go along with it. And then we work with other like-minded organizations to actually try to take that further as data and data use continues to evolve in our very interconnected data-driven world.

Chris D’Agostino:
So let’s dive into that a little bit. In talking with other customers of Databricks and ones that are transforming their organizations to be more AI and data-driven, they talk about the way in which they do data science now, the way in which they train their models. From my own experience working at Capital One, you oftentimes saw single node data science work being done where you’d have a developer or a data scientist working on his or her laptop, pulling in data into that environment. And so the thing that most organizations are learning about AI in particular is that the most important factor is having a large volume of quality data that you can work off of to do your feature engineering in your training. And so the smaller your data set, the more risk there is that there might be some inherent bias inside the data sets. So can we talk a bit about bias, and the paper talks about algorithms have in certain instances reinforced bias or spread misinformation. So we’d love your thoughts on just how you see data sets affecting or influencing making sure that bias doesn’t inadvertently get introduced.

JoAnne Stonier:
Okay. This is a big topic, so I welcome your thoughts as well. When we started looking at our AI and our AI machine learning processes, we break it down into kind of three key areas that we work on at Mastercard, starting with the data sets. So while we agree that larger data sets are easier for our data scientists to work with and smaller data sets can be problematic, we really look at what data sets are we using and are they fit for purpose? And what is the source of the information? And is there inherent bias in the data set itself? So my favorite example of the moment, because we just came through an election cycle, I’ll have to update it soon, is that if you use the voter rolls from 1910, they may be accurate and they may not be, but because they were prepared probably very manually. But if you understand that those data sets are predominantly male, that’s fine. If you don’t and you use that data set, and you’re trying to solve a problem for society today, it’s going to be highly inaccurate, because there’s an inherent bias in the dataset.

JoAnne Stonier:
This is the challenge of AI and machine learning that I think we have to raise all awareness on. So the data set may be robust. It may have a good population, but it may not be fit for the purpose for which a data scientist is using it today. And so what is in the data? How has it been compiled? Does it have any flat sides for the inquiry that you’re trying to solve for is really super important in this day and age. And so that’s one of the issues. We did not create all of our datasets for AI and machine learning, we created data sets for all sorts of reasons. And the machine or the inquiry is asking the question, well, what conditions exist in this data set and what can I learn from it?

JoAnne Stonier:
Well, the data set may have some incorrect or just inherent conditions that will tell the machine something, and then it will draw a conclusion that may be wrong about the population. So it may then draw a conclusion that well, women are less likely to vote than men. Well, that is true from that dataset, but it’s not actually true in fact if you’re going to apply it to today’s conditions. So these are the things that I think we as data scientists, data designers, have to actually understand as we look at different data sources and as we look at the quality, the consistency, the accuracy, the completeness of data sets as we put them into our AI and machine learning process. So that side I think is important.

Chris D’Agostino:
This is a really good example of the potential of model drift. You build a model, you train that model on a given data set, you have all the right intentions, and then you realize there’s some bias that’s baked into it. And then decision engines are executing that model and making decisions, right?

JoAnne Stonier:
Mm-hmm (affirmative). And it happened also in the algorithm process itself. So the machine can also draw these conclusions just by the baseline information. Another example, you feed in an English dictionary and a Spanish dictionary because you want to do a translation. The machine is going to read all of that in. And so then it’s going to impute from what it has been given, that doctors are men. Well, how did that happen? Well, the pronouns in Spanish, ‘El Doctor’, right?. Again, it has nothing to do with the fact that there was intentional misinformation. It’s that the machine is going to impute what information it has been given it will learn from. And again, those pronouns in Spanish will have implications into English and into then information that will be derived if we don’t understand what we fed into the system.

Chris D’Agostino:
Yeah, absolutely. So then the process internal to a company in terms of how do you recognize that the bias exists, how do you account for it, how do you fix it, we would do some very upstream analysis of looking at the data sets and how those data sets evolve, how their schemes might evolve. We would actually use classification algorithms to determine if the actual data feeds are changing and if they’re consistent with the prior history, if you will. So, do you have mechanisms in place that, or techniques that you’d recommend to our listeners about how you establish that governance life cycle around making sure the data that’s continuing to come into the organization and maybe new data sets that get derived don’t inadvertently introduce this kind of bias?

JoAnne Stonier:
Well, some of what you’re talking about. So you have to consistently look at the data set that’s being consumed, you have to have the quality assessment ongoing, even if it’s a data set that you’ve been consuming over time, you have to understand its lineage. You have to understand how it’s been created. That assessment has to be part and parcel of your AI and analytics process. You also have to look for proxy variables. So even when you think you understand a dataset, there are proxy variables that you have to scrutinize heavily, location is always one of them. Location is a variable that can derive a ton of information when it’s linked up with other things. So it’s going to be a proxy or a stand in for other information that the machine will begin to learn from if it’s connected enough times to a pattern. And so these are other types of scans that we put into our governance practice before anything else happens.

Chris D’Agostino:
So JoAnne, we’ve talked a lot today about ethics in data and AI. And one of the things that I’d love to hear from you is what advice would you give to your peers that are in similar roles that are trying to ensure that data and ethics and AI are a part of the solution for how they drive business value moving forward?

JoAnne Stonier:
Well, I think we were doing some of this at Mastercard now. We’re beginning to really look at all of our products and solutions and recognizing that every business is becoming a data-driven business and we’re trying to understand how do we help our customers with their data needs. So looking and evaluating all of your products and solutions from a data perspective, what’s the data that goes into that. So what’s your data sources, understanding the vendors you’re using. We touched a little bit about on them when we talked about responsible AI, do you understand the AI that you’re purchasing as much as the AI that you’re creating? I think everybody needs to understand that. But in addition, what products and solutions are you creating based on data, and do you have those responsible data practices so that you understand the quality of the information that’s going into that? And do your product designers, all of your business folks understand some of the inadvertent biases that you may be kind of creating by using the wrong data for the wrong purpose? We’re starting to retrain our corporate brain so that we raise the awareness on that.

JoAnne Stonier:
I think too, that we need to really look at our people along the lines of training and education, I think we also have to make sure that we have like a diverse and inclusive workforce so that you have lots of different perspectives to solve problems from a multi-verse of talent. If you’re missing that, I think you’re also going to be missing the types of solutions that are needed as well as what are those inadvertent impacts that you could be creating in the solutions of the future. And so I think all of my peers we struggle with talent just because data talent is hard to find, but I do think that creating the right kind of educational pipelines are really incumbent upon all of us as leaders to make sure that we have the right people sitting around the table designing into the future.

JoAnne Stonier:
And then really it is a question of governance and process. Are you going to put in the ethical questions and challenges into your processes to make sure that you ask those tough questions? And do you have somebody in your organization who’s authorized say no? Are they truly empowered to say, we’re not going to go there, that this doesn’t align with our principles? I literally was on a conversation this morning, where we were talking about our principles and saying, you know what, this doesn’t line up, we’re not going to do it. And I was really happy because it wasn’t me who said that, it was somebody else in the organization. So I know our principals are sticking and that we’re sticking to our principles, which is really a nice place to go. So I advocate that approach for a lot of my colleagues out there, because it’s a way to get other people to be responsible in the organization beyond the chief data officer.

Chris D’Agostino:
Yeah. Talking to you about all these different topical subjects, the thing that comes to mind for me now is we’ve come out of this election cycle, we’ve come through, at least partway through the COVID pandemic and the response, and now the rollout of the vaccines and so that the global impact, and what I’m thinking about is just misinformation, disinformation, and how much each of us has kind of gotten rooted in our beliefs, and sometimes our beliefs aren’t fully based on facts. And so I’m trying to think, over the next five years, while in some ways people have been more divided, I think the awareness is much greater than it ever was before about what information can we trust? What’s the veracity of the data that I’m actually consuming to form my opinion? So I’m curious like over the next five years, what you see in terms of information accuracy and ethical use of data, like how 2020 as crazy of a year as it’s been, or was, how you see that shaping the five years ahead.

JoAnne Stonier:
I’m so glad you brought this up. I was having this discussion at the risk committee of our board and we were doing a round robin of the risks ahead, what risks do we need to navigate, and we were talking specifically about data risk and there was a whole host of risks that a whole bunch of folks put up on the board. And they came to me and I spoke about misinformation as one of the key risks. And everyone was like, Oh. I think data lineage is going to become increasingly important. Where did the data originate from and how trustworthy is that source and is that a primary source or a secondary source? And what is the opportunity for that information to be manipulated?

JoAnne Stonier:
We have seen the impact of misinformation and really what it can bring about. How can we create the right kind of tagging mechanisms and exchanges so that we can, including blockchain and other mechanisms, so that we can begin to validate that information is accurate and complete before we use it in the type of data science that we’ve been discussing in AI, because the amplified effects of misinformation is really, really quite a scary future. And so I think it’s incumbent upon all sorts of organizations in data sharing ecosystems to figure out lineage of information, which will then improve the accuracy and decrease misinformation. But it’s also incumbent that we come up with the right type of techniques to tell the stories around information, make big data a little bit more easy to access for the average person, so they understand how information about them as being processed, how it’s being used, so that they have greater comfort in the world that’s being created about them and around them, and truthfully to create true benefit for them as well.

Chris D’Agostino:
Okay, JoAnne. So thanks. This was a really interesting discussion. One of the things that we ask all of our data and AI leaders on this podcast is to ask them, what advice would you give aspiring data CDOs, CDAOs, people that really want to forge a career in data and data science and data analytics, what advice would you give those people?

JoAnne Stonier:
I think the first thing is be curious if you want a career in this space, know that the space will change. It’s the one constant. The use of data, the amount of data is only going to grow and the techniques and tools are also going to increase. So you’ve got to stay current. And so being curious about it and enjoying that change I think is a key skill. If you fight the change, you’re going to be miserable. And if you don’t like change, this is not a good space.

JoAnne Stonier:
I think the other tip is to be generous, be generous with your time, be generous with your knowledge, because of that change, you kind of have to be willing to give of your knowledge in order to gain additional knowledge. And then the last piece is always pushed at the margins of your knowledge. I had a boss who taught me that a long, long time ago, because it’s the best way to learn. So just keep learning. And I think again, data has been really, really good to me as a career. It has kept me interested. It has kept me excited, and I also think it has incredible possibilities to solve problems I had. And so I think it’s a great career choice. So if people are interested, I think they should absolutely jump in. It’s a lot of fun.

Speaker 1:
Thank you for joining this episode of Champions of Data and AI, brought to you by Databricks. Thousands of data leaders rely on Databricks to simplify data and AI, so data teams can innovate faster and solve the world’s toughest problems. Visit databricks.com to learn how data leaders are unlocking the true potential of all their data.