ProPublica's COMPAS Data Revisited
Published 2019-06-11Version 1
In this paper I re-examine the COMPAS recidivism score and criminal history data collected by ProPublica in 2016, which has fueled intense debate and research in the nascent field of `algorithmic fairness' or `fair machine learning' over the past three years. ProPublica's COMPAS data is used in an ever-increasing number of studies to test various definitions and methodologies of algorithmic fairness. This paper takes a closer look at the actual datasets put together by ProPublica. By doing so, I find that ProPublica made an important data processing mistake when it created some of the key datasets most often used by other researchers. In particular, the datasets built to study the likelihood of recidivism within two years of the original COMPAS screening date. As I show in this paper, ProPublica made a mistake implementing the two-year sample cutoff rule for recidivists in such datasets (whereas it implemented an appropriate two-year sample cutoff rule for non-recidivists). As a result, ProPublica incorrectly kept a disproportionate share of recidivists. This data processing mistake leads to biased two-year recidivism datasets, with artificially high recidivism rates. This also affects the positive and negative predictive values. On the other hand, this data processing mistake does not impact some of the key statistical measures highlighted by ProPublica and other researchers, such as the false positive and false negative rates, nor the overall accuracy.