Transcripts For CSPAN2 Michael Kearns Aaron Roth The Ethica

CSPAN2 Michael Kearns Aaron Roth The Ethical Algorithm July 13, 2024

Where the subject of algorithmic fairness or privacy is not frontpage news today we will speak to the two leading lights in that area they will help us understand the state of the art is now and Going Forward with that we will welcome professor kearns first to the stage. Welcome to the stage. [applause] good morning thank you for coming my name is Michael Kearns with my close friend and colleague writing a general audience book called the ethical algorithm so for roughly half an hour we want to take you had a high level to the major themes of the book then we will open to q a. So many people and certainly this audience is aware in the past decade Machine Learning has gone from relatively obscure to mainstream news to characterize the first half of this decade is the glory. When all reports were positive with all these amazing advances like deep learning and speech recognition image categorization in many other areas we all enjoy the great benefits of this technology and the advances made but the last few years have been a buzz kill many articles have been written now some popular books on the Collateral Damage we caused by this type of decisionmaking is specially powered by ai Machine Learning so weapons of mass distraction was a big seller and it did a very good job to make very real and visceral ways the way algorithm decisionmaking can result in discriminatory predictions and Racial Discrimination david and goliath is a wellknown book we are close to a commercial surveillance state and what a companies that we read these books and to be like them very much and many others like them but one of the things we found lacking as a motivation when you get to that solution section it is what we are considered fairly traditional we need better laws, regulations we need to keep an eye on this. And we agree with all of that but as researchers are working in the field we also know there is a movement to design algorithms that are better in the first place so waiting for a predicted model you could think about making the algorithm better in the first place so now there is a fairly large Scientific Community trying to do exactly that so you can think of it as a Popular Science book to explain to the reader how you can encode and embed social norms directly into algorithms themselves. Now a couple of remarks we got a review on an early draft of the book that says i think your title is a conundrum or the oxymoron. How could it be any more ethical than a ham . A hammer is a tool like human designed artifact for purposes and while it is possible to make the unethical use i could even on the head nobody makes the mistake to ascribe unethical behavior or activity to the hammer itself if i hit you on the hand you would blame me and not the hammer we know real harm had come to you because of me hitting you so basically they said i dont see why the same arguments dont apply to the algorithm. We thought about this and decided we disagreed they are different even though they are tools we think they are different for a couple of reasons because its difficult to predict outcomes and to ascribe blame because especially when powered by ai and Machine Learning there is a pipeline usually use start off with complicated data in the sense very high dimensional with a lot of variables like medical database of medical records and we may not understand where it comes from in the first place and then the usual pipeline or methodology is you take that data to turn it into an optimization problem with the objective landscape and we want a model that does well with the data in front of us it is primarily or exclusively concerned with predictive accuracy theres nothing more natural to do than to take a data set and save find the Neural Network and to decide who to give alone. So then what results is a complicated high dimensional model light clipart from the internet deep learning so theres a Neural Network with many layers so the point is the pipeline is very diffuse it may not be entirely easy to pin down blame was at the data or the optimization procedure . And even worse if this predictive model causes real harm if you are falsely denied alone because the network said you should be then we may not even be aware and because we give them so much autonomy to hit you on the hand with a hammer i have to pick it up and hit you now they run autonomously without any human intervention so we may not realize the harms that are being caused unless we explicitly know to look for them. So our book is how to make things better but to revisit the pipeline and to modify in ways that give us social norms like privacy and fairness and accountability et cetera. One of the things about this endeavor even though many scholarly communities and others have thought about the social norms before us like philosophers have been thinking of fairness people think about privacy. They never had to think in such a precise way you would actually write them into a Computer Program algorithm. Sometimes just the act to force yourself to be that precise with these concepts you would not discover any other way. So the whirlwind tour of the book is a series of discussions about different social norms and what the science looks like to give a precise definition, a mathematical definition in the algorithm and what the consequences of doing that with tradeoffs. So in general if i want an algorithm thats more fair its at the cost of less accuracy. So we have written these in increasing shades of failure to how mature the sciences in each area. When it comes to privacy this is the deal that is the most mature and what we think is the right definition and quite a bit known how to embed that definition. Fairness a little bit lighter but off to a very good start in things like accountability like the morality because we feel there are not even good technical definitions and i promise you that there is a singularity so that the rest of our time we will talk about privacy and fairness. That the twist that it takes half way through. As michael mentioned by far the most welldeveloped fields i want to spend a few minutes to give a brief history and in that process to go through a case study how we could think precisely swimming 20 or 25 years ago with a really had in mind was a data set and to release this to anonymize the records to remove the names of those unique identifiers like Social Security numbers like age or zip code that would not uniquely identify me. In 1987 the state of massachusetts decided to release data set that is useful for medical reset one researchers and the state of massachusetts corresponding to every state employee with data set there were no names or Social Security numbers there are ages and zip codes and genders. So it turns out although age and zip code and genders not enough to identify you in combination they can be. A professor at harvard figured out that you could cross reference the online data set with registration records that also had demographic information and then cross reference with anonymized set with cambridge and then with those identifiers to identify the medical record from the governor of massachusetts at the time and sent this to her desk to make a point. This is a big deal in the study of data privacy and people tried to fix this problem by using a bandaid to fix the most recent attack. So for example if it turns out combinations zip code and age so instead of reporting age exactly maybe we will only report a zip code of three digits and do this to make sure that any combination of attributes that we release doesnt just correspond to one person to know my 56 yearold neighbor who is a woman was at the Hospital University of pennsylvania with the guarantee i cannot connect those attributes so for a little while people tried to do this if you look at the data set you may already begin to realize this is what we mean by privacy because if i know my 56 yearold neighbor attended hospital at the university of pennsylvania i cannot figure out her diagnosis but i can figure out she could have colitis but it doesnt speak to the math i know that shes a patient is now at two hospitals now the other hospital has records the same way. May be a little better because now my female neighbor matches three of these. That both data sets have been released i can crossreference and there is a unique record that could possibly correspond to my neighbor and now i have her diagnosis. This is the overall problem which is the same we tried removing the name maybe attempts at privacy would work if the data set i was releasing was the only thing out there. Small amounts of idiosyncratic information is enough to identify in ways to uncover if i can crossreference the data set that has been released. Some people try patching this up for a long time data privacy was cat and mouse trying futuristic things to patch up whatever vulnerability led to the most recent attack and attackers trying new things it was a losing game. We were trying to do things we thought were private. So this is the approach over two weeks in and attempt to think what privacy might mean but think what a strong and then we will find the right answer. Lets think about what privacy should mean. If i use data sets for medical studies nobody should be able to learn anything about you as a particular individual that they could not learn about you had the study not be conducted and to make it more concrete make it as the british doctors study the first piece of evidence that smoking and lung cancer were correlated. Because every doctor was invited to participate and two thirds did they agree to have medical records included as part of a study. Very quickly there was an association. So imagine you are one of the doctors if you are a smoker you made no attempt to hide that fact youre probably smoking during this presentation but then when its published they know Something Else now they know you are at increased risk for lung cancer because now we learn new facts that it is correlated. In the us it could cause concrete harm that your Health Insurance rates may have gone u up. So if we say what privacy Means Nothing new should be learned about you as a result we have to call the british doctors study a violation of your privacy. But there are a couple of things that are wrong about that. First of all the stories play out in the same way even if you decided not to have your data included. The suppose it violation the fact that i learned you are at a higher risk of lung cancer thats not something i learned about your data in particular. I already knew you were a smoker but the violation of privacy is attributed to the facts of the world that smoking and lung cancer are correlated we know that wasnt your secret to keep because i discovered that without your data from any large sample of population. We call that a violation of privacy then we could do Data Analysis at all because there will always be correlations between what is publicly observable in what you dont want them to know and i couldnt uncover any correlation at all without a violation of this type. So this is an attempt to think what privacy should mean. And the real breakthrough came in 2006 when Computer Scientists had the idea for differential privacy. The goal is to promise something very similar with a slight twist. So think about two possible worlds not where the study is carried out or not but instead the alternative world where it still carried out but without your data everything is the same except your data was removed from the data set. The idea is we want to assert in the ideal world if it was not used at all there is no privacy violation because we didnt even look at your data. And then in the real world it was but if there is no way for me to tell substantially better for me and im guessing in the real world or idealized world then we should say your privacy is only minimal violated what differential privacy says it is the difference there is no way to tell the difference that is substantially better than random guessing compared to the world where we dont hear we can quantify to trade off accuracy with privacy so you may think this is too strong when you think about it it sounds like a satisfying definition and you may worry its too strong to allow anything useful to be done to go through this simple example 15 years of Research Shows any statistical path or analysis which includes all of Machine Learning could be done with privacy although at a cost that typically manifests itself in the need for diminished accuracy. And with the Academic Work this has moved to become something that is widely deployed if you have an iphone for example might be actively reporting statistics from differential privacy but the shock will come in just about a year the us 2020 census will release all the information under privacy. This is the fence that we talk about it is the most welldeveloped not that we know everything there is to know but a strong definition and we understand the algorithm you need while still doing useful things with data and this is real technology. I will talk similar about algorithm fairness the study of fairness is less mature than privacy in particular we already know it will be messier so we argue anyone that thinks long and hard enough will survive so differential privacy is the right definition of data privacy we know there is no monolithic definition of fairness in the past few years there have been a couple of publications of broad form that say can we all agree any good definition of fairness meets the mathematical properties cracks they say yes of course these are weak minimal properties i want them and a stronger winds also. The punchline is guess what cracks eras that there proving there is no definition of fairness that can simultaneously achieve these. So a little more concrete that this may mean in real applications that if you are trying to reduce the discriminatory behavior of your algorithm by gender that could be increased cost of discrimination you may face the moral and conceptual tradeoffs. But this is the reality the way things are so we still propose proceeding to study those alternate definitions and the consequences. So i want to show you how things can go wrong with racial or gender discrimination and how that leads to a proposal how one could address Collateral Damage. So why Machine Learning even in the past few weeks im sure youve heard of these notable instances with a Health Assessment model that is widely used in Healthcare Systems to show systematic Racial Discrimination and less scientifically a twitter storm recently over apple credit card a number of reports of married couples the husband said my wife and i file taxes jointly she has a higher Credit Rating but i got ten times the credit limit that she did we were actually with the federal regulators office that is investigating this particular issue i like the Health Assessment model we dont know if this is systematic underlying gender discrimination but this is the concern we talk about so like the two medical databases take you through how things can go wrong to build predictive models so lets imagine we were asked to help them develop a predictive model for collegiate success based only on two variables high school gpa and sat so what i show is a sample of data points each represents the high school gpa and the why value is the sat score. If this is a sample of individuals that were admitted so we know if they succeeded or not and by succeed pick any quantifiable subjective definition we can measure that success means you graduated within five years to matriculate three. Oh gpa or that you donate at least 10 million within 20 years of leaving thats what the plus and minus mean so for each point the gpa and the sat for those that succeeded in the minus means otherwise so with this cloud first of all a few counted carefully less than half succeeded slightly more minus they on plus also if i just show this and point if you build a good predictive model to predict whether applicants will succeed or not there is a line you can draw and if we predict everybody above is successful and everyone below is not successful we do a pretty good job. Its not perfect but for the most part we are doing a good job and in that simplify aid form even including deep Neural Networks to separate the positive from the negative. So lets suppose it in the steam and the same pool lets call them the orange population first of all they are a minority in the literal mathematical since fewer orange points than there were green points. The data looks different the sat scores are systematically lower but no less qualified for college. There is exactly the same number of orange plus as orange minus is not that the population is less successful in college even though they have lower sat scores. One reason you might imagine this is the case so when the green population they can afford sat preparation courses and multiple retakes of the exam and the orange population is less wealthy with fewer resources do selfstudy and take it once and take what they can get. If we had to build a predictive model for the orange population there is a good one this perfectly separates positives from negatives. The problem arises look at that combined data set what is the single model that did that population and you can see that visually. So i will pick up so many green minuses it will increase by me trying to do that. This is the optimal model on the aggregated question it is intuitively unfair to reject all the qualified orange applicants so we could call this the false rejection rate on the orange population is close to 100 percent and on the green population is close to 0 percent. Of course what we should do is notice the orange population is systematically lower sat scores and we should build a twopart model to say if you are green we apply this line and if you are orange we will apply this line. And the single model of the aggregate data it would not only make it more fair but we als