Transcripts For CSPAN2 Michael Kearns Aaron Roth The Ethica

CSPAN2 Michael Kearns Aaron Roth The Ethical Algorithm July 13, 2024

Own work when the subject of algorithmic fairness for privacy is not frontpage news. Today we are going to speak to the two leading lights in area they will help us understand what the state of the art is that what the state of the art will be going forward. With that i think we will welcome professor Michael Kearns first to the stage. Is that right . Great. Michael and erin, welcome to the stage. [applause] ok, good morning. Thanks to everyone for coming. My name is Michael Kearns and with my close friend and colleague aaron roth we have coauthored a book, a general artist book called the ethical algorithm whose who subtitle e site of socially aware outcome design. We want to do for roughly half an hour so is just take you a high level to what some of the major themes of the book are and then well open it up as jeff said to q a. I think many, many people and certainly this audience is well aware that in the past decade or so wishing learning has gone from a relatively obscure corner of ai to mainstream news and i would characterize the first half of this take it as the glory period when all the news reports were positive and were hearing about all these amazing advances in areas like deep learning which has applications in speech recognition, image processing, image categorizatn and many, many other areas. We all enjoyed the great benefits of this technology and advances that were made but the last few years or so have been more of a buzz kill and there been many, many articles that been written and now even some popular books on essentially a Collateral Damage that can because by algorithmic decisionmaking especially powered by ai and Machine Learning. Heres a few of those books, weapons of math destruction, did a very good job of making very real and visceral and personal the ways in which algorithmic decisionmaking can be sold in discriminatory productions like gender does commission, Racial Discrimination or the like david and the light is wellknown book about the fact that weve essentially become something akin to a commercial surveillance state and the breaches of privacy and trust and security that the company that. Aaron and i read these books and would like these books very much and many others like them but one of the things we found lacking which was much of the motivations for writing our own was when you get to the solution section of these books, what should we do about these problems, the solution suggested i would be considered traditional ones. They basically say we need better laws, better regulations, watchdog groups. We really did keep an eye on this stuff. We agree with all of that but as Computer Scientists in Machine Learning researchers working directly in the field we also know theres been a movement in the past five to ten years to design algorithms that are better in first place. Rather than after the fact you wait for some predictive model to exhibit Racial Discrimination and criminal sentencing, you could think about making the algorithm better in the first place. Theres now a fairly large Scientific Community in the Machine Learning Research Area in a adjacent areas that is trying to do exactly that. Our book is really you can think of it like a popular sites book will try to explain to the reader how you would go about trying to encode and embed social norms that we care about directly into algorithms themselves. I couple preparatory remarks. We got a review on an early draft of the book that basically said i think your title is a conventional or possibly even an oxymoron. What you mean an ethical algorithm . How can an algorithm be any more ethical than a a hammer . This reviewer pointed out an algorithm like a hammer is a tool, it is a human design for particular purposes. While its possible to make an ethical use of a hammer, for instance, i might decide to hit you, nobody would make the mistake of ascribing in unethical behavior or immoral activity to the hammer itself. If i hit you on the hand with a hammer you would blame before him and you and i would both know that real harm had come to because of my hitting you on the hand with a hammer. This reviewer said i do see why the same arguments apply, tone applied algorithms. We thought about this and decided we disagreed. We think algorithms are different even though they are indeed just tools that are human artifacts for particular purposes. We think they are different. Its very difficult to predict outcomes and also difficult to ascribe blame. Part of the reason is algorithmic decisionmaking when powered by ai in Machine Learning has a pipeline. Let me quickly review what that pipeline is. Usually start off with very, very complicated data, complicate in the sentence for high and as many variables and it might have many, many rows. Think like a medical database of individual citizens medical records. We may not understand this date in into detail and may not understand where it came from in the first place. It may been gathered from many different sources. The usual pipeline or methodology of Machine Learning is to take that data entered into some sort of optimization problem. We have some objective landscaper or space of models and want to find the model that does well on the date in front of us. Usually that objective is primarily or often even exclusively concerned with predictive accuracy or some notion of utility or profit. Theres nothing more natural to do in the world if you are in Machine Learning research or practitioner to take a data set and say lets find the network that on this day makes the fewest mistakes in deciding who to give a loan, for example. You do that and then what result is some raps again complicate high dimensional model. This is classic clip art from the internet of deep learning. This is a Neural Network with many, many layers between the inputs and outputs and lots of transformations of the date and variables happening. So the point is a a couple this about this pipeline. Its diffuse. Something goes wrong in the spotlight that might not be entirely easy to pin down the blame. Was at the data, the objective function, the optimization procedure that produced the Neural Network or was it the Neural Network itself . Even worse than that, if this algorithm of this predictive model we use at the and causes real harm to somebody, if you are also denied a loan because the Neural Network said you should be denied the lord, we may not win this is happening to scale behind the scenes we may not be aware i he joined the hand with a hammer. Also because we give algorithms so much autonomy. To reach on hand with a hammer i had to pick the thing up and hit you. These days algorithms are running autonomously awful without any human intervention so it may not realize the harmsm study because unless we know to explicitly go look for them. Our book is a how to make things better, not through regulation and laws and the like but by revisiting this pipeline and modify it in ways that give us there is social norms we care about like privacy fairness accountability et cetera. One of the interesting and important things about this endeavor is even though many, many, many scholarly communities and others have thought about these social norms before us, for instance, philosophers have been thinking about fairness for time immemorial, lots of people thought about things like privacy and the like, theyve never had to think about these things in such a precise way that you could actually writing into a Computer Program or an algorithm. Sometimes just the act of forcing yourself to be that precise can reveal flaws in your intuitions about this concept that you were not going to discover any of the way and will give concrete examples of that during a presentation. The whirlwind tour of the book is a series of discussions about different social norms, which ive written down, and what the science looks like of going in and giving a precise definition to these things, a mathematicl definition, event encoding that definition in an algorithm and a portly and what the consequences of doing that are come in particular tradeoffs. In general it i want to get it out a rhythm that is or private that might come at the cost of less accuracy, for example, and we can talk about this as a go. You youll notice ive written e different social norms in increasing shades of gray and what that roughly represents is our subjective view of how mature the site and each one of these areas is. In particular we think when it comes to privacy, this is the field thats in relative terms the most mature and is what we think is the right definition of data privacy and quite a bit note about how to embed that definition in powerful algorithms including Machine Learning. Fairness which is a little bit lighter is a more recent more nascent field but its often a very good start. And things like accountability, interpretability or even morality are in greater shades because in these cases we feel like there are not good technical definitions so its hard to get started about encoding these things. I promise you there is a bottom bowl of you which as the sink of its entirely in white so you cant even see it. What were going to do with the rest of our time is to talk about privacy and fairness which cover roughly the first half of the book and then we will spend if you were to tell you this sort of game erotic twist the book takes about midway through. So im going to turn over to aaron for a bit now. Thanks. So as michael mentioned, privacy is by far the most welldeveloped of these fields we talk about and somewhat to spend a few minutes just giving you a brief history of the study of data privacy which is about0 years old and in that process try to go through a key study at home but think precisely about definitions. It used to be maybe 20, 20 figures ago when people talked about releasing data set sets a way that was privacy preserving, what they had in mind was an attempt at anonymization. I would have some data sets of individuals, peoples records, they might have peoples names and and i would just have one or two release those try to anonymize the records are removing the names and maybe if i was careful of the unique identifiers like Social Security numbers. I would keep things like age or zip code, features about people that were not enough to uniquely identify me. So in 1997 the state of massachusetts decided to release a data set that would be useful for medical researchers. Medical data sets are for researchers to get their hands on because of privacy concerns and the state of massachusetts had an enormous data set of medical records. They released the data set in a way that was anonymize. There were no names, no Social Security numbers but the ages, zip codes, genders. So it turns out that although age is not enough to uniquely identify you, zip code is not enough to uniquely identify you, in combination they can be. There was a student who is at mit at the time, now a professor at harvard who figured this out there in particular she figured out you could crossreference the supposedly anonymize data set with Voter Registration records which also had demographic information like zip code and so scooting number and gender but together with names. She crossreferenced this anonymize medical davis said inches able with this triple identifier, identify the record, medical records of bill weld, who was governor of massachusetts at the time. She said his records to his desk to make a point. This was a big deal in the study of data privacy and people tried to fix this problem by basically just using little bandaids trying to most directly six whatever the most recent attack was. So, for example, people thought all right, if it turns out combinations of zip code engine and age to uniquely identify someone in the record why do we try coarsening that information. Instead of reporting it exactly maybe we both reported up to maybe ten years, maybe report zip code only up to three digits and we will do this so we can make sure that in combination of attributes in this table that we release doesnt correspond to just one person. So, for example, if i know that my 56yearold neighbor who is a woman attended some hospital, maybe the hospital at the university of pennsylvania, and they released an anonymize data set in this way they of the guarantee i cannot connect the attributes that i know about my neighbor to just one record. I can connect them to two most records. For a little while people tried doing this. If you think about it, if you look at the data set you might already begin to realize this is a getting quite ample need by privacy because although if i know my 56 your female neighbor attending hospital at the University Pennsylvania i cant figure out what her diagnosis is because it corresponds to two records, i can figure out either she is hiv or colitis which might already be something she didnt want me to know. But if both of these data sets have been released i can just crossreference them and theres a unique record, theres only one record could possibly correspond to my neighbor and all of a sudden ive got her diagnosis so the overall problem here is the same as it was when we just tried removing names and its just that maybe attempts at privacy like this would work if the data sets that i was releasing was the only thing out there but thats never the case. And the problem is small amounts of syncretic information are enough to identify you in ways i can uncover if i crossreference the data sets that have been released with other sets that are out there so people tried catching this up as well but for a long time the history of data privacywas a cat and mouse game where researchers would try to do heuristic things, patching up whatever vulnerability led to the most recent attack. And attackers trying to do newclever things and this was a losing game for privacy researchers. Part of the problem is we were trying 2 things. Do things we hoped were private without ever defining what wemeant by privacy. This is the approach that was too weak. Let me in an attempt to think about what privacy might mean talk about an approach thatis too strong and then we will find the right answer. So you might say okay, lets think about what privacy should mean. Maybe if im going to use data sets to conduct for example medical studies , what i want is that nobody should be able to learn anything about you as a particular individual that they couldnt have learned about you have a study not been conducted. That would be a strong notion of privacy if we could promise this. And maybe to make more concrete lets come to be known as the british doctors study, a study carried out by dahl and hill in the 1950s and it was the first piece of evidencethat smoking and lung cancer had Strong Association. So its called the british doctors study because every doctor in the uk was invited to participate in the study and two thirds of them actually did. Two thirds of doctors in the uk agreed to have their records included as part of the study and very quickly it became apparent there was a Strong Association between smoking and lung cancer so imagine that youre one of thedoctors who participated in the study. Say youre a smoker and this is 50s so you definitely made no attempt to hide the fact that youre a smoker, youd probably be smoking in this presentation and everyone knows youre a smoker but when this study was published , all of a sudden everyone knows Something Else about you. They didnt know before. In particular they know you are at an increased risk for lung cancer because all of a sudden we learned this new factabout the world that smoking and lung cancer are correlated. In fact if youre in the us this might have caused you concrete harm at the time in the sense that your Health Insurance rates might have gone up so this could have caused you concrete quantifiable harm. So if we were going to say that what privacy means is that nothing new should be learned about you as a result of conducting a study we would have to call the british doctors study a violation of your privacy. But theres a couple of things that are wrong about. First of all, observed that the story could have played out in exactly the same way even if you are one of the doctors who decided not to have your data included in the study. The supposed violation of your privacy in this case, the fact that i learned you are at higher risk of lung cancer wasnt something i learned about your data in particular. I already knew you were a smoker before the study was carried out. A violation of privacy would have to be treated to the facts about the world that i learned that smoking and lung cancer were correlated and that wasnt your secret to keep. And the way we know that wasnt your secret tokeep is i could have discovered that without your data. I could have discovered that from any efficiently large sample. And if we were going to call things like that a violation of privacy and we couldnt do it any Data Analysis at all because there are always going to be correlations between things that are publicly observable about you and things you didnt want people to know and i couldnt uncover any correlation in the data at all withouthaving a privacy violation of this type. So this was an attempt at thinking about what privacy should mean , giving it a semantics but it was one that was too strong. A real breakthrough came in 2006 when a team of mathematical Computer Scientists had the idea for what is now called differential privacy and the goal of differential privacy is due find something thats similar to what we wanted to promise in the british doctors studybut with a slight twist. So again, think about two possible worlds but now dont think about the world in which the study is carried out and the world in which the study is not carried out but instead about the world in which the study is carried out and an alternative world where the study is still carried out but without your data. Everything is the sameexcept your data was removed from the data set. And the