What is a data audit?

A data audit is a step-by-step process that examines every step of the data science process. Problems can be introduced at any step of this process, so a full audit requires close examination at each step. In another post I'll talk about tests given to uncover problems and possible remediation for problems found at each step. For now I'll assume we have full access to the model in question, although many of these questions can be addressed even when there are limits to access.

At the highest level a data audit has four phases:

  1. DATA
  2. DEFINE
  3. BUILD
  4. MONITOR

In order to audit a given algorithm we delve into phase-specific questions.

DATA-related questions:

  1. What data have you collected? Is it relevant and do you have enough and the right kind?
  2. What is the integrity of this data? Does it have bias? Is some of the data more or less accurate? How do you test this?
  3. Is your data systematically missing important types of data? Is it under- or over-representing certain types of events, behaviors, or people?
  4. How are you cleaning the data, dealing with missing data, outlying data, or unreasonable data? What is your ground truth for dealing with this kind of question?

DEFINE-related questions:

  1. How do you define "success" for your algorithm? Are there other related definitions of success, and what do you think would happen if you tweaked that definition? 
  2. What attributes do you choose to search through to potentially associate with success or failure? To what extent are your attributes proxies instead of directly relevant to the definition of success, and what could go wrong?

BUILD-related questions:

  1. What kind of algorithm should you use?
  2. How do you calibrate the model?
  3. How do you decide when the algorithm has been optimized?

MONITOR-related questions:

  1. To what extent is the model working in production?
  2. Does it need to be updated over time?
  3. How are the errors distributed?
  4. Is the model creating unintended consequences?
  5. Is the model playing a part in a larger feedback loop?

On measuring disparate impact

For the past few days I've been contemplating how the Consumer Financial Protection Bureau (CFPB), or anyone for that matter, might attempt to measure disparate impact. This is timely because the CFPB is trying to nail auto dealers for racist practices, and an important part of those cases is measuring who should receive restitution and how much.

As I wrote last week, the CFPB has been under fire recently for using an imperfect methodology to guess at a consumer's race with proxy information such as zip code and surname. Here's their white paper on it. I believe the argument between the CFPB and the bankers they're charging with disparate impact hinges on the probability threshold they use: too high, and you get a lot of false negatives (skipping payments to minority borrowers), too low and a lot of false positives (offering money to white borrowers).

Actually, though, the issue of who is what race is only one source of uncertainty among many. Said another way, even if we had a requirement that the borrowers specify their race on their loan application forms, like they do for mortgages because of a history of redlining (so why don't we do it for other loans too?), we'd still have plenty of other questions to deal with statistically.

Here's a short list of those concerns, again assuming we already know the minority status of borrowers:

  1. First, it has to be said that it's difficult if not impossible to prove an individual case of racism. A given loan application might have terms that are specific to that borrower and their situation. So it is by nature a statistical thing - what terms and interest rates do the pool of minority applicants get on their loans compared to a similar pool of white applicants?
  2. Now assume the car dealerships have two locations. The different locations could have different processes. Maybe one of them, location A is fairer than the other, location B. But if the statistics are pooled, the overall disparate impact will be estimated as smaller than it should be for location B but bigger for location A.
  3. Of course, it could also be that different car dealers in the same location treat their customers differently, so the same thing could be happening in one location.
  4. Also, over time you could see different treatment of customers. Maybe some terrible dude retires. So there's a temporal issue to consider as well.
  5. The problem is, if you try to "account" for all these things, at least in the obvious way where you cut down your data, you end up looking at a restricted location, for a restricted time window, maybe for a single car dealer, and your data becomes too thin and your error bars become too large.
  6. The good thing about pooling is that you have more data and thus smaller error bars; it's easier to make the case that disparate impact has taken place beyond a reasonable statistical doubt.
  7. Then again, the way you end up doing it exactly will obviously depend on choices you make - you might end up deciding that you really need to account for location, and it gives you enough data to have reasonably small error bars, but another person making the same model decides to account for time instead. Both might be reasonable choices.
  8. And so we come to the current biggest problem the CFPB is having, namely gaming between models. Because there are various models that could be used, such as I've described, there's always one model that ends up costing the bank the least. They will always argue for that one, and claim the CFPB is using the wrong model with "overestimates" the disparate impact.
  9. They even have an expert consultant who works both for the CFPB and the banks and helps them game the models in this way.

For this reason, I'd suggest we have some standards for measuring disparate impact, so that the "gaming between models" comes to an end. Sure, the model you end up choosing won't be perfect, and it might be itself gameable, but I'm guessing the extent of gaming will be smaller overall. And, going back to the model which guesses at someone's minority status, I think the CFPB needs to come up with a standard threshold for that, and for the same reason: not because it's perfect, but because it will prevent banks from complaining that other banks get treated better.

Sentencing more biased by race than by class

Yesterday  I was please to be passed along this blogpost from lawyerist.com called Uncovering Big Bias with Big Data and written by David Colarusso, a lawyer who became a data scientist.

For the article, David mines a recently opened criminal justice data set from Virginia, and asked the question, what affects the length of sentence more: income or race? His explanation of each step is readable by non-technical people.

The answer he came up with is race, by a long margin, although he also found that class matters too.

In particular he fit his data with the outcome variable set to length of sentence in days - or rather, log(1 + that term), which he explains nicely - and he chose the attributes to be the gender of the defendant, a bunch of indicator variables to determine the race of the defendant (one for each race except white, which was the "default race," which I thought was a nice touch), the income of the defendant, and finally the "seriousness of the charge," a system which he built himself and explains. He gives a reasonable explanation of all of these choices except for the gender.

His conclusion:

For a black man in Virginia to get the same treatment as his Caucasian peer, he must earn an additional $90,000 a year.

This sentence follows directly from staring at this table for a couple of minutes, if you imagine two defendants with the same characteristics except one is white and the other is black:

It's simplistic, and he could have made other choices, but it's a convincing start. Don't trust me though, take a look at his blogpost, and also his github code which includes his iPython notebook.

I am so glad people are doing this. Compared to shitty ways of using data, which end up doubling down on poor and black folks, this kind of analysis shines a light on how the system works against them, and gives me hope that one day we'll fix it.

ProPublica report: recidivism risk models are racially biased

Last May an exciting ProPublica article entitled Machine Bias came out. Written by Julia Angwin, author of Dragnet Nation, and Jeff Larson, data journalist, the piece explains in human terms what it looks like when algorithms are biased.

Specifically, they looked into a class of models I featured in my upcoming book, Weapons of Math Destruction, called "recidivism risk" scoring models. These models score defendants and give those scores to judges to help them decide how long to sentence them to prison, for example. Higher scores of recidivism are supposed to correlate to a higher likelihood of returning to prison, and people who have been assigned high scores also tend to get sentenced to longer prison terms.

What They Found

Angwin and Larson studied the recidivism risk model called COMPAS. Starting with COMPAS scores for 10,000 criminal defendants in Broward County, Florida, they looked at the  difference between who was predicted to get rearrested by COMPAS versus who actually did. This was a direct test of the accuracy of the risk model. The highlights of their results:

  • Black defendants were often predicted to be at a higher risk of recidivism than they actually were. Our analysis found that black defendants who did not recidivate over a two-year period were nearly twice as likely to be misclassified as higher risk compared to their white counterparts (45 percent vs. 23 percent).
  • White defendants were often predicted to be less risky than they were. Our analysis found that white defendants who re-offended within the next two years were mistakenly labeled low risk almost twice as often as black re-offenders (48 percent vs. 28 percent).
  • The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more likely to be assigned higher risk scores than white defendants.
  • Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent recidivists were 63 percent more likely to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists.
  • The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 percent more likely to be assigned higher risk scores than white defendants.

Here's one of their charts (lower scores mean low-risk):

How They Found It

ProPublica has the highest standards in data journalism. Which is to say, they published their methodology, including a description of the (paltry) history of other studies that looked into racial differences for recidivism risk scoring methods. They even have the data and the ipython notebook they used for their analysis on github.

They made heavy use of the open records law in Florida to do their research, including the original scores, the subsequent arrest records, and the classification of each person's race. That data allowed them to build their analysis. They tracked both "recidivism" and "violent recidivism" and tracked both the original scores and the error rates. Take a look.

How Important Is This?

This is a triumph for the community of people (like me!) who have been worrying about exactly this kind of thing but who haven't had hard proof until now. In my book I made multiple arguments for why we should expect this exact result for recidivism risk models, but I didn't have a report to point to. So, in that sense, it's extremely useful.

More broadly, it sets a standard for how to do this analysis. The transparency involved is hugely important, because nobody will be able to say they don't know how these statistics were computed. They are basic questions by which every recidivism risk model should be measured.

Having said that, the question of false negatives and false positives, which Angwin's team focused on, is not the standard for optimizing recidivism risk algorithms. Instead, the creators of this model tend to optimize to a "well-calibrated" model, which is to say to make sure rates of high-risk defendants of a given race reflect actual recidivism rates for that race. There are plenty of reasons this might not be ideal, but it is the norm.

What's Next?

Until now, recidivism risk models have been deployed naively, in judicial systems all across the country, and judges in those systems have been presented with such scores as if they are inherently "fair."

But now, people deploying these models - and by people I mostly mean Department of Corrections decision-makers - will have pressure to make sure the models are audited for (various forms of) racism before using them. And they can do this kind of analysis in-house with much less work. I hope they do.

Algorithms are as biased as human curators

The recent Facebook trending news kerfuffle has made one thing crystal clear: people trust algorithms too much, more than they trust people. Everyone's focused on how the curators "routinely suppressed conservative news," and they're obviously assuming that an algorithm wouldn't be like that.

That's too bad. If I had my way, people would have paid much more attention to the following lines in what I think was the breaking piece by Gizmodo, written by Michael Nunez (emphasis mine):

In interviews with Gizmodo, these former curators described grueling work conditions, humiliating treatment, and a secretive, imperious culture in which they were treated as disposable outsiders. After doing a tour in Facebook’s news trenches, almost all of them came to believe that they were there not to work, but to serve as training modules for Facebook’s algorithm.

Let's think about what that means. The curators were doing their human thing for a time, and they were fully expecting to be replaced by an algorithm. So any anti-conservative bias that they were introducing at this preliminary training phase would soon be taken over by the machine learning algorithm, to be perpetuated for eternity.

I know most of my readers already know this, but apparently it's a basic fact that hasn't reached many educated ears: algorithms are just as biased as human curators. Said another way, we should not be offended when humans are involved in a curation process, because it doesn't make that process inherently more or less biased. Like it or not, we won't understand the depth of bias of a process unless we scrutinize it explicitly with that intention in mind, and even then it would be hard to make such a thing well defined.

Todd Schneider's "medium data"

Last night I had the pleasure of going to a Meetup given by Todd Schneider, who wrote this informative and fun blogpost about analyzing taxi and Uber data.

You should read his post; among other things it will tell you how long it takes to get to the airport from any NYC neighborhood by the time of day (on weekdays). This corroborates my fear of the dreader post-3pm flight.

His Meetup was also cool, and in particular he posted a bunch of his code on github, and explained what he'd done as well.

For example, the raw data was more than half the size of his personal computer's storage, so he used an external hard drive to hold the raw data and convert it to a SQL database on his personal computer for later use (he used PostgreSQL).

Also, in order to load various types of data into R, (which he uses instead of python but I forgive him because he's so smart about it), he reduced the granularity of the geocoded events, and worked with them via the database as weights on square blocks of NYC (I think about 10 meters by 10 meters) before turning them into graphics. So if he wanted to map "taxicab pickups", he first split the goegraphic area into little boxes, then counted how many pickups were in each box, then graphed that result instead. It reduced the number of rows of data by a factor larger than 10.

Todd calls this "medium data" because, after some amount of work, you can do it on a personal computer. I dig it.

Todd also gave a bunch of advice for people to follow if they want to do neat data analysis that gets lots of attention (his taxicab/ Uber post got a million hits from Reddit I believe). It was really useful and good advice, the most important of which was, if you're not interested in this topic, nobody else will be either.

One interesting piece of analysis Todd showed us, which I can't seem to find on his blog, was a picture of overall rides in taxis and Ubers, which seemed to indicate that Uber is taking over market share from taxis. That's not so surprising, but it actually seemed to imply that the overall number of rides hasn't changed much; it's been a zero-sum game.

The reason this is interesting is that de Blasio's contention has been that Uber is increasing traffic. But the above seems to imply that Uber doesn't increase traffic (if "the number of rides" is a good proxy for traffic); rather, it's taking business away from medallion cabs. Not a final analysis by any stretch but intriguing.

Finally, Todd more recently analyzed Citibike rides, take a look!

The SHSAT matching algorithm

My 13-year-old took the SHSAT in November, but we haven't heard the results yet. In fact we're expecting to wait two more months before we do.

What gives? Is it really that complicated to match kids to test schools?

A bit of background. In New York City, kids write down a list of their preferred public high schools that are not "SHSAT" schools. Separately, if they decide to take the SHSAT, they rank their preferences for those, which fall into a separate category and which include Stuyvesant and Bronx Science. They are promised that they will get into the first school on the list that their SHSAT score allows them to.

I often hear people say that the algorithm to figure out what SHSAT school a given kid gets into is super complicated and that's why it takes 4 months to find out the results. But yesterday at lunch, my husband and I proved that theory incorrect by coming up with a really dumb way of doing it.

  1. First, score all the tests. This is the time-consuming part of the process, but I assume it's automatically done by a machine somewhere in a huge DOE building in Brooklyn that I've heard about.
  2. Next, rank the kids according to score, highest first. Think of it as kids waiting in line at a supermarket check-out line, but in this scenario they just get their school assignment.
  3. Next, repeat the following step until all the schools are filled: take the first kid in line and give them their highest pick. Before moving on to the next kid, check to see if you just gave away the last possible slot to that particular school. If so, label that school with the score of that kid (it will be the cutoff score) and make everyone still in line erase that school from their list because it's full and no longer available.
  4. By construction, every kid gets the top school that their score warranted, so you're done.

A few notes and one caveat to this:

  1. Any kid with no schools in their list, either because they didn't score high enough for the cutoffs or because the schools all filled up before they got to the head of the line, won't get into an SHSAT school.
  2. The above algorithm would take very little time to actually run. As in, 5 minutes of computer time once the tests are scored.
  3. One caveat: I'm pretty sure they need to make sure that two kids with the same exact score and the same preference would both either get in or get out (because think of the lawsuit if not). So the actual way you'd implement the algorithm is when you ask for the next kid in line, you'd also ask for any other kid with the same score and the same top choice to step forward. Then you'd decide whether there's room for the whole group or not.

So, why the long wait? I'm pretty sure it's because the other public schools, the ones where there's no SHSAT exam to get in (but there are myriad other requirements and processes involved, see e.g. page 4 of this document) don't want people to be notified of their SHSAT placement 4 months before they get their say. It would foster too much unfair competition between the systems.

Finally, I'm guessing the algorithm for matching non-SHSAT schools is actually pretty complicated, which is I think why people keep talking about a "super complex algorithm." It's just not associated to the SHSAT.

Do Charter Schools Cherrypick Students?

Yesterday I looked into quantitatively measuring the rumor I've been hearing for years, namely that charter schools cherrypick students - get rid of troublesome ones, keep well-behaved ones, and so on.

Here are two pieces of anecdotal evidence. There was a "Got To Go" list of students at one charter school in the Success Academy network. These were troublesome kids that the school was pushing out.

Also, I recently learned that Success Academy doesn't accept new kids after the fourth grade. Their reasoning is that older kids wouldn't be able to catch up with the rest of the kids, but on the other hand it also means that kids kicked out of one school will never land there. This is another form of selection.

Now that I've said my two examples I realize they both come from Success Academy. There really aren't that many of them, as you can see on this map, but they are a politically potent force in the charter school movement.

Also, to be clear, I am not against charter schools as a concept. I love the idea of experimentation, and to the extent that charter schools perform experiments that can inform how public schools run, that's interesting and worthwhile.

Anyhoo, let's get to the analysis. I got my data from this DOE website, down at the bottom where I clicked "citywide results" and grabbed the following excel file:

With that data, I built an iPython Notebook which is on github here so you can take a look, reproduce my results with the above data (I removed the first line after turning it in to a csv file), or do more.

From talking to friends of mine who run NYC schools, I learned of two proxies for difficult students. One is 'Percent Students with Disabilities' and the other is 'Percent English Language Learners' (I also learned that charter schools' DBN code starts with 84). Equipped with that information, I was able to build the following histograms:

Percent Students with Disabilities, non-Charter:

 

Percent Students with Disabilities, Charter:

Percent English Language Learners, non-Charter:

 

Percent English Language Learners, Charter. Please note that the x-axis differs from above:

 

I also computed statistics which you can look at on the iPython notebook. Finally, I put it all together with a single scatterplot:

 

The blue dots to the left and all the way down on the x-axis are mostly test schools and "screened" schools, which are actually constructed to cherrypick their students.

The main conclusion of this analysis is to say that, generally speaking, charter schools don't have as many kids with disabilities or poor language skills, and so when we compare their performance to non-charter schools, we need to somehow take this into account.

A final caveat: we can see just by looking at the above scatter plot that there are plenty of charter schools that are well inside the middle of the blue cloud. So this is not a indictment on any specific charter school, but rather a statistical statement about wanting to compare apples to apples.

Update: I've now added t-tests to test the hypothesis that this data comes from the same distribution. The answer is no.

Those very small numbers are the p-values which are much smaller than 0.05. Other t-tests give similar results (but go ahead and try them yourself):

screen-shot-2015-11-19-at-11-06-13-am.png