Transcripts For CSPAN2 Michael Kearns Aaron Roth The Ethica

Transcripts For CSPAN2 Michael Kearns Aaron Roth The Ethical Algorithm 20240713

Talk about their book the ethical algorithm. I think a day does not go by in the news or otherwise or in our own work when the subject of algorithmic fairness or privacy is not frontpage news. Today were going to speak to the two leading lights in that area and theyre going to help us understand whatthe stateoftheart is now and what the stateoftheart will be going forward. With that, i think we will welcome Michael Kearns first to the stage. Michael kearns andaaron roth, welcome to the stage. [applause] good morning, thanks to everyone for coming. My name is Michael Kearns and with my close friend and colleague erin roth we have coauthored a book, a general audience book called the ethical algorithm subtitle is the science of socially aware algorithm design. So we want to do for roughly half an hour or so is take you at a high level through what some of the major themes of the book are and then we will open it up as jeff said to q a. So i think many, many people and certainly this audience is well aware that in the past decade or so, Machine Learning has gone from a relatively obscure corner of ai to mainstream news. And i would characterize the first half of this decade as the glory period when all the news reports were positive and we were hearing about all these amazing advances in areas that learning which has applications in speech recognition, image processing, image categorization and many other areas so we all enjoyed the great benefits of this technology and the advances that were made. But the last few years or so have been more of a buzzkill. And there have been many articles written and now even some popular books on essentially the Collateral Damage i can because i algorithmic decisionmaking, especially algorithmic decisionmaking powered by ai and Machine Learning so heres a few of those books, weapons of mass destruction was a big bestseller a few years ago and it did a good job of making very real and visceral and personal the ways in which algorithmic decisionmaking can result in discriminatory predictions like gender discrimination, Racial Discrimination or the like. David and goliath is a wellknown book about the fact that weve essentially become think again to a commercial surveillance state and the breaches of privacy and money in trust and security thataccompany that. And erin and i have read these books and we like these books very much and many others like them. But one of the things we found lacking in these books which was much of the motivation for writing our own is when you get to the solution section of these books, what should we do about these problems, the solutions suggested are what we consider traditional ones. They basically say we need better laws, better regulations. We need watchdog groups. We need to keep an eye on this stuff andwe agree with all that. But as computer scientists and Machine Learning researchers working in the field, also know theres been a movement in the past 5 to 10 years to sort of design algorithms that are better in the first place so rather than afterthefact you wait for some predictive model to lets say exhibit Racial Discrimination in criminal sentencing, you could think about making the algorithm better in the first place and there is now a fairly fairly large Scientific Community in the Machine Learning Research Area and adjacent areas who do exactly that so our book is, a science book. Were trying to explain to the reader how you would go about trying to encode and embed social norms that we care about into our rhythms themselves. Now, a couple preparatory remarks. We got a review on an early draft of the book that basically said i think your title is a conundrum or possibly even an oxymoron. What do you mean an ethical algorithm . How can an algorithm be more ethical than a hammer . This reviewer pointed out that algorithm like a hammer is a tool. The human design artifacts for particular purposes and while its possible to make unethical use of a hammer, for instance i might decide to hit you on the hand, nobody would make the mistake of ascribing any unethical behavior or immoral activity to the hammer itself. If i you on hand with a hammer would blame me for it and you and i both know that real arm had come to you the cause of my hitting you on the hand with a hammer so this review basically said i dont see why these same arguments apply to our rhythms. We thought about this for a while and we decided we disagree. We think that algorithms are different even though they are indeed tools that are human artifacts or a particular purpose. We think theyre different for a couple of reasons. One of them is that its difficult to predict outcomes and also difficult to ascribe blame and part of the reason for this is algorithmic decisionmaking when powered by ai and Machine Learning as a pipeline so let me review what that pipeline is. You usually start off in some perhaps complicated data, complicated in the sense that its high dimensional and has many variables might have many roads. So think of a medical database for instance, of individual citizens medical records and we may not understand this data in any detail and may not even understand where it came from in the first place area it may have been gathered many disparatesources. And then the usual pipeline or methodology of Machine Learning is to take that data and turn it into some sort of optimization problem. We have some objective landscape over the space of models and we want to find a model that does well on the data in front of us and usually that objective is primarily or often exclusively concerned with predicate accuracy or some notion of utility or profit. Theres nothing more natural to do in the world if youre a Machine Learning practitioner and to take a data set and say lets find the Neural Network that on this data makes the fewest mistakes in deciding who to give a loan so you do that and what result is some perhaps complicated, high dimensional model. This is classic clip art from the internet of deep learning. This is a Neural Network with many layersbetween the input and output and lots of transformations of the data variables. The point is a couple things about this pipeline, its very diffuse. If something goes wrong in this pipeline it might not be easy to pin down the blame. Was it the data, was it the optimization procedure that produced a Neural Network or was it the Neural Network itself and worsen that, if this algorithm for this predictive model that we use at the end causes real harm to somebody, if you are falsely denied alone for instance because the Neural Network that you should be denied alone, when this is happening we may not even be aware that i you on hand with a hammer and also because we get algorithms so much autonomy these days. To hit you on hand with a hammer i have to pick the thing up and hit you. These days algorithms are running autonomously without human intervention so we may not even realize the harm being caused unless we know to explicitly look for them. So our book is about how to make things better, not through regulation and laws by actually revisiting this pipeline and sort of modifying it in ways that give us various social norms that we care about like privacy, fairness, accountability, etc. And one of the interesting and importantthings about this endeavor is that even though many , many scholarly communities and others thought about these social norms before us, so for instance philosophers have been thinking about this for time immemorial. Lots of people have thought about privacy and the like. Never had to think about these things in such a precise way that you could actually write them into a Computer Program or into an algorithm and sometimes just the act of forcing yourself to be that precise and reveal flaws in your intuitions about these concepts that you were going to discover any other way and we will get concrete examples of that during our presentation so the whirlwind high tour of the book is a series of sessions about different social norms, some of which ive written down here and what the science looks like actually going in and giving a precise definition to these things, a mathematical definition and then encoding that mathematical definition and an algorithm and importantly, what the consequences of doing that are area in particular tradeoffs so in general i want to an algorithm thats more fair or more private, the cost of less accuracy for example and we will talk about this as we go so you will notice ive written these different social norms in increasing shades of gray. And what that roughly represents is our subjective view of how mature the science in each one of these areas is so in particular we think when it comes to privacy in relative terms that you the most mature is what we think is the right definition of data privacy quite a bit known about howto embed that definition in powerful algorithms including Machine Learning. Fairness is a little bit lighter is the more recent more nascent field but is off to a very good start area and things like accountability and interpretability or even morality are in prayer change the cause in these cases we feel like their argument good technical definitions yet its hard to get started about encoding these things in algorithms and i promise you that theres a bottom tier which says the singularity but its entirely in white you cant even see it. So what were going to do with the rest of our time is talk about privacy and fairness which cover roughly the first half of the book and then we will have a few words about telling you the game theoretic twist at the book takes about midway through so im going to turn it over to erin for a bit now. So as michael mentioned, privacy is by far the most welldeveloped field that we talk about so i want to spend a few minutes just giving you a brief history of the study of data privacy which is about 20 years old now and in the process try to go through a case study of how we might think concisely about definitions. So it used to be maybe 20, 25 years ago that when people talk about releasing data sets in a way that was privacy preserving, what they had in mind was an opposition. I would have some data set of individuals, people record and they might in my data set out peoples names and i would just if i want to release this, try to anonymize the records by removing the name and maybe if i was careful, other unique identifiers like Social Security numbers but keepthings like eight or zip code , features about people at work enough to uniquely identify me. So in 1997, the state of massachusetts decided to release a dataset that would be useful for medical researchers. Medical data sets are hard for researchers to get their hands on because of privacy concerns and the state of massachusetts had an enormous data set of medical records, the records were every state employee in massachusetts and theyrelease these data set in a way that was anonymize. There were no Social Security numbers that there were ages, there were genders. So it turns out that although age is not enough to uniquely identify you, zip code is not enough to uniquely identify you, gender is not enough, in commendation they can be and there was a piece named latonya sweeney who was at mit at the time. Who figured this out and in particular she figured out you could cross reference the supposedly anonymized data sets with Voter Registration records also had demographic information like a zip code and Social Security number and gender the other with names so she crossreferenced the anonymized medical data sets with Voter Registration records of cambridge massachusetts and was able to with the triple identifiers, identify the record, medical record of all well was the governor of massachusetts at the time. She put those records on the net to make a point. So this was a big deal in the study of data privacy and for a long time people tried to access problem i basically just using little bandaids, trying to most directly fix whatever the most recent attack was so for example people not all right, it turns out that commendations of code and gender and age can uniquely identify someone in a record, what we tried coarsening that information instead of reporting age exactly, maybe we record up to an interval of 10 years, maybe onlyreport zip code up to three digits. And we will do this so that we can make sure that any combination of attributes in this table that we release doesnt correspond to justone person. So for example if i know that my 56yearold neighbor was a woman attended some hospital, maybe the hospital at the university of pennsylvania and they release anonymized data set in this way, they guarantee that i cannot the attributes that i know about my neighbor to get one record. I can connect them to two records, thats less intentional though for a little while people tried doing this. And if you think about it, if you look at the data set, you might already begin to realize this isnt getting quite what we need my privacy because although if i know that my 56yearold female neighbor attended the hospital at the university of pennsylvania i cant figure out what her diagnosis is because it response the records and i can figure out either hiv or politeness might already be something he didnt want me to know what the problem actually goes much deeper. Suppose that i know shes been a patient, not just at one hospital but to hospitals. And the other hospital has also released records anonymized in the same way and maybe even a little better because now my 56yearold female neighbor matches not just to three of these records. If both of these data sets have been released i can just reference them and their unique record, only one record that can possibly correspond to myneighbor and all of a sudden ive got her diagnosis. So the overall problem here is the same as it was when we just tried removing names and if that maybe attempts at privacy like this would work if the data set that i was releasing was the only thing out there. But thats never the case and the problem is, small amounts of idiosyncratic information are enough to identify you in ways i can uncover if i can cross reference the data set that then release with all the stuff thats out there. Though people try catching this up as well but for a long time the history of data privacy was a cat and mouse game where data privacy researchers would try to do heuristic things acting up whatever vulnerability led to the most recent attack. And attackers try new clever things and this was a losing game forprivacy researchers. Part of the problem is we were trying to do things, trying to do things we hope were private without even ever really defining what we meant by privacy area this was an approach that was too weak. Let me in an attempt to think about what privacy might mean talk about an approach that is sort of strong and then we will findthe right answer. So you might say okay, lets think about what privacy should mean. Maybe if im going to use data sets to conduct for example medical studies, what i want is that nobody should be able to learn anything about you as a particular individual they couldnt have learned about you the study not been conducted. It would be a very strong notion of privacy if we could promise it. And maybe connect more concrete with about whats come to be known as the british doctors study, a study carried out by all and hill in the 1950s and it was the first piece of evidence that something in lung cancer a Strong Association. So its all the british doctors study because every doctor in the uk was invited to participate in the study and two thirds of them did. Two thirds of doctors in the uk agreed to have their medical records included as part of the study. And very quickly, it became apparent there was a Strong Association between and lung cancer imagine you are one of the doctors who just in the study. Say youre a smoker. And this is the 50s so you definitely have made no attempt to hide the fact that youre a smoker, you probably be smoking in this presentation and everyone knows that youre a smoker but when the study is published, all of a sudden everyone knows about you. You know before and in particular they know that you are at an increased risk for lung cancer because all of a sudden you learn this new fact about the world, that smoking and lung cancer correlate. If youre in the us, it might have caused you concrete harm at the time in the sense that your Health Insurance rates might have gone up so this could have caused you concrete quantifiable harm. So if we were going to say that what privacy means is that nothing new should be learned about you as a result of conducting a study, we would have to all call the british doctors any violation of your privacy. But theres a couple of things that are wrong about this. First of all, observed that the story played out in exactly the same way even if you were one of the doctors who decided not to have your data included in the study. The supposed violation of your privacy in this case,the fact that i learned that you are at higher risk of lung cancer , that wasnt something that i learned about your data in particular. I knew you were a smoker before the study was carried out. A violation of privacy would have to be introduced to the facts about the world are learned, smoking and lung cancer were correlated and that was your secret to keep and the way we know that wasnt your secret to keep as i could have discovered that without your data, discovered that from any traditionally largesample of the population. And if we were going to call things like that violation of privacy, we couldnt do any Data Analysis at all was there are always going to be correlations between things that are observable about you and think you didnt want people to know and i couldnt uncover any correlation of the data at all without having a privacy violation of this size. So this was an attempt thinking about what privacy should mean, getting a semantics but it was one that was too strong area and the real breakthrou