Introduction

This is a problem set from Chapter 2 of the online textbook FAIRNESS AND MACHINE LEARNING by Solon Barocas, Moritz Hardt, Arvind Narayanan.

Risk assessment is an important component of the criminal justice system. In the United States, judges set bail and decide pre-trial detention based on their assessment of the risk that a released defendant would fail to appear at trial or cause harm to the public. While actuarial risk assessment is not new in this domain, there is increasing support for the use of learned risk scores to guide human judges in their decisions. Proponents argue that machine learning could lead to greater efficiency and less biased decisions compared with human judgment. Critical voices raise the concern that such scores can perpetuate inequalities found in historical data, and systematically harm historically disadvantaged groups.

In this case study, we’ll begin to scratch at the surface of the complex criminal justice domain. Our starting point is an investigation carried out by ProPublica of a proprietary risk score, called COMPAS score. These scores are intended to assess the risk that a defendant will re-offend, a task often called recidivism prediction. Within the academic community, the ProPublica article drew much attention to the trade-off between separation and sufficiency.

We’ll use data obtained and released by ProPublica as a result of a public records request in Broward Country, Florida, concerning the COMPAS recidivism prediction system. The data is available here. Following ProPublica’s analysis, we’ll filter out rows where days_b_screening_arrest is over 30 or under -30, leaving us with 6,172 rows.

Calibration/sufficiency

Let’s plot the fraction of defendants recidivating within two years (two_year_recid == 1) as a function of risk score (decile_score), for Black defendants (race == “African-American”) and White defendants (race == “Caucasian”).

Based on the plot, the risk score does seem to satisfy sufficiency across racial groups. However, the risk score seems to be biased against White defendants in the 9~10 range. White defendants with a score of 9 or 10 re-offend less than White defendants with a score of 8, meaning that they didn’t deserve such harsh scores.

Error rates/separation

Plot the distribution of scores received by the positive class (recidivists) and the distribution of scores received by the negative class (non-recidivists) for Black defendants and for White defendants.

Based on the above plots, we see that COMPAS did not achieve separation between the risk score and race. Among all recidivists (true positives), White defendants are more likely to be labeled as low risk (false negative). Among all non-recidivists (true negatives), Black defendants are more likely to be labeled as high risk (false positive).

Now let’s calculate the exact Positive Predictive Value, False Positive Rate, and False Negative Rate for Black and White defendants (with a risk threshold of 4 (i.e., defendants with decile_score >= 4 are classified as high risk).

race	prediction	outcome	count
African-American	Non-recidivist	Recidivist	315
African-American	Non-recidivist	Non-recidivist	694
African-American	Recidivist	Recidivist	1346
African-American	Recidivist	Non-recidivist	820
Caucasian	Non-recidivist	Recidivist	310
Caucasian	Non-recidivist	Non-recidivist	854
Caucasian	Recidivist	Recidivist	512
Caucasian	Recidivist	Non-recidivist	427

PPV

Positive Predictive Value (recidivists given >=4 score) for Black defendants: 1346/(820+1346)=0.62
Positive Predictive Value (recidivists given >=4 score) for White defendants: 512/(512+427)=0.55

FPR

False Positive Rate (false positives given all actual negatives) for Black defendants: 820/(820+694)=0.54
False Positive Rate (false positives given all actual negatives) for White defendants: 427/(427+854)=0.33

FNR

False Negative Rate (false negatives given all true positives) for Black defendants: 315/(315+1346)=0.19
False Negative Rate (false negatives given all true positives) for White defendants: 310/(310+512)=0.38

Let’s now find two thresholds (one for Black defendants, one for White defendants) such that FPR and FNR are roughly equal for the two groups.

Top 10 Thresholds for most similar FPR and FNR
threshold_black	threshold_white	PPV_black	PPV_white	FPR_black	FPR_white	FNR_black	FNR_white	FPR_diff_percent	FNR_diff_percent
9	7	0.770	0.685	0.083	0.083	0.748	0.720	0.000	3.743
8	6	0.750	0.651	0.139	0.135	0.618	0.607	2.878	1.780
7	5	0.710	0.595	0.228	0.220	0.492	0.496	3.509	0.813
3	2	0.597	0.463	0.656	0.628	0.114	0.156	4.268	36.842
6	4	0.684	0.545	0.314	0.333	0.380	0.377	6.051	0.789
5	3	0.650	0.505	0.423	0.455	0.285	0.277	7.565	2.807
4	2	0.621	0.463	0.542	0.628	0.190	0.156	15.867	17.895
4	3	0.621	0.505	0.542	0.455	0.190	0.277	16.052	45.789
10	9	0.837	0.709	0.024	0.029	0.886	0.891	20.833	0.564
5	4	0.650	0.545	0.423	0.333	0.285	0.377	21.277	32.281

The above table shows the top 10 combination of thresholds that can achieve the most similar FPR and FNR for Black and White defendants. The best combinations are the top three rows. In each case, we have to set the threshold for White defendants two points lower than Black defendants. In each case, we have White PPV significantly lower than Black PPV.

Risk factors and interventions

Let’s look at the recidivism rate of defendants aged 25 or lower, and defendants aged 50 or higher. Note the stark difference between the two: younger defendants are far more likely to recidivate.

age_label	recidivism_rate
<=25	0.5514706
>=50	0.3046964

Discussion

Suppose we are interested in taking a data-driven approach to changing the criminal justice system. Under a theory of incarceration as incapacitation (prevention of future crimes by removal of individuals from society), we may want to keep more younger defendants locked up because we expect them to re-offend more.
Under a rehabilitative approach to justice, however, we seek to find interventions that minimize a defendant’s risk of recidivism. We may believe that younger people are more malleable than older defendants, therefore more likely to be affected by interventions. As a result, we may want to give younger defendants more resources and help so that they can reshape their lives.
Under a retributive theory of justice, punishment is based in part on culpability, or blameworthiness; this in turn depends on how much control the defendant had over their actions. Under such a theory, we may believe that younger people are less mature and therefore less capable of controling their actions. As a result, we should not publish them as harshly as older adults.

Fairness and Machine Learning - COMPAS Criminal Justice Case Study

Eric Fan

11/14/2021

Introduction

Calibration/sufficiency

Error rates/separation

PPV

FPR

FNR

Risk factors and interventions

Discussion