Chapter 37 Investigating Disparate Impact
In this chapter, we learn about how to detect evidence of disparate impact by using tests like the 4/5ths Rule (80% Rule), \(Z_{D}\) test, \(Z_{IR}\) test, \(\chi^2\) test of independence (i.e., chi-square test of independence), and Fisher exact test.
37.1 Conceptual Overview
Disparate impact, which is also referred to as adverse impact, refers to situations in which organizations “use seemingly neutral criteria that have a discriminatory effect on a protected group” (Bauer et al. 2025). There are various tests – both statistical and nonstatistical – that may be used to evaluate whether there is evidence of disparate impact, such as the 4/5ths Rule (80% Rule), \(Z_{Difference}\) test (i.e., \(Z_{D}\) test), \(Z_{Impact Ratio}\) test (i.e., \(Z_{IR}\) test), \(\chi^2\) test of independence, and Fisher exact test. In the United States, it is legally advisable to begin with testing the 4/5ths Rule followed by a statistical test like the \(Z_{IR}\) test, \(\chi^2\) test of independence, or Fisher exact test.
If you would like to learn more about how to evaluate disparate impact along with empirical evidence regarding under which conditions various tests perform best, I recommend checking out the following resources:
Collins, M. W., & Morris, S. B. (2008). Testing for adverse impact when sample size is small. Journal of Applied Psychology, 93(2), 463-471.
Dunleavy, E., Morris, S., & Howard, E. (2015). Measuring adverse impact in employee selection decisions. In C. Hanvey & K. Sady (Eds.), Practitioner’s guide to legal issues in organizations (pp. 1-26). Switzerland: Springer, Cham.
Finch, D. M., Edwards, B. D., & Wallace, J. C. (2009). Multistage selection strategies: Simulating the effects on adverse impact and expected performance for various predictor combinations. Journal of Applied Psychology, 94(2), 318-340.
Morris, S. B. (2001). Sample size required for adverse impact analysis. Applied HRM Research, 6(1-2), 13-32.
Morris, S. B., & Lobsenz, R. E. (2000). Significance tests and confidence intervals for the adverse impact ratio. Personnel Psychology, 53(1), 89-111.
Office of Federal Contract Compliance Programs. (1993). Federal contract compliance manual (SUDOC L 36.8: C 76/993). Washington, DC: U. S. Department of Labor, Employment Standards Administration.
Roth, P. L., Bobko, P., & Switzer, F. S. III. (2006). Modeling the behavior of the 4/5ths rule for determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91(3), 507-522.
37.2 Tutorial
This chapter’s tutorial demonstrates how to compute an (adverse) impact ratio to test the 4/5ths Rule (80% Rule) and how to estimate the \(Z_{D}\) test, \(Z_{IR}\) test, \(\chi^2\) test of independence, and Fisher exact test.
37.2.1 Video Tutorial
As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorials below.
Link to video tutorial: https://youtu.be/1e-ZW-AC_6o
Link to video tutorial: https://youtu.be/oddQdDH73oQ
37.2.2 Functions & Packages Introduced
Function | Package |
---|---|
xtabs |
base R |
print |
base R |
addmargins |
base R |
proportions |
base R |
chisq.test |
base R |
phi |
psych |
fisher.test |
base R |
sum |
base R |
prop.table |
base R |
sqrt |
base R |
abs |
base R |
pnorm |
base R |
log |
base R |
exp |
base R |
37.2.3 Initial Steps
If you haven’t already, save the file called “DisparateImpact.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"
). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.
Next, using the setwd
function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.
Next, read in the .csv data file called “DisparateImpact.csv” using your choice of read function. In this example, I use the read_csv
function from the readr
package (Wickham, Hester, and Bryan 2024). If you choose to use the read_csv
function, be sure that you have installed and accessed the readr
package using the install.packages
and library
functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
df <- read_csv("DisparateImpact.csv")
## Rows: 274 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Cognitive_Test, Gender
## dbl (1): ID
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "ID" "Cognitive_Test" "Gender"
## spc_tbl_ [274 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:274] 1 2 3 4 5 6 7 8 9 10 ...
## $ Cognitive_Test: chr [1:274] "Pass" "Pass" "Pass" "Pass" ...
## $ Gender : chr [1:274] "Man" "Man" "Man" "Man" ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Cognitive_Test = col_character(),
## .. Gender = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## # A tibble: 6 × 3
## ID Cognitive_Test Gender
## <dbl> <chr> <chr>
## 1 1 Pass Man
## 2 2 Pass Man
## 3 3 Pass Man
## 4 4 Pass Man
## 5 5 Pass Man
## 6 6 Pass Man
There are 3 variables and 274 cases (i.e., applicants) in the df
data frame: ID
, Cognitive_Test
, and Gender
. ID
is the unique applicant identifier variable. Imagine that these data were collected as part of selection process, and this data frame contains information about the applicants. The Cognitive_Test
variable is a categorical (nominal) variable that indicates whether an applicant passed (Pass
) or failed (Fail
) a cognitive ability selection test. Gender
is also a categorical (nominal) variable that provides each applicant’s self-reported gender identify, which in this sample happens to include Man
and Woman
.
37.2.4 4/5ths Rule
Evaluating the 4/5ths Rule involves simple arithmetic. As an initial step, we will create a cross-tabulation (i.e., cross-tabs, contingency) table using the xtabs
function from base R. As the first argument, in the xtabs
function, type the tilde (~
) operator followed by the projected class variable (Gender
), followed by the addition (+
) operator, followed by the selection tool variable containing the pass/fail information (Cognitive_Test
). As the second argument, type data=
followed by the name of the data frame object (df
) to which the two variables mentioned in the first argument “live.” Next, to the left of the xtabs
function, type any name you’d like for the new cross-tabs table object we’re creating using the xtabs
function, followed by the left arrow (<-
) operator; I call this cross-tabs table object observed
because it will contain our observed frequency/count data for the number of people from each gender who passed and failed this cognitive ability test.
Let’s take a look at the table object we created called observed
by entering its name as the sole parenthetical argument in the print
function from base R.
## Cognitive_Test
## Gender Fail Pass
## Man 90 70
## Woman 72 42
As you can see, the observed
object contains the raw counts/frequencies of those men and women who passed and failed the cognitive ability test.
Now we’re ready to compute the selection ratios (i.e., selection rates, pass rates) for the two groups we wish to compare. In this example, we will focus on the protected class variable Gender
and the two Gender
categories available in these data: Man
and Woman
. The approach we’ll use is not the most efficient or elegant way to compute selection ratios, but it is perhaps more transparent than other approaches – and thus good for learning.
Let’s begin by calculating the total number of men who participated in this cognitive ability test. To do so, we will use matrix/bracket notation to “pull” specific values from our observed
table object that correspond to the number of men who passed and the number of men who failed; if we add those two values together, we will get the total sample size for men who participated in this cognitive ability test. To illustrate how matrix/bracket notation works, let’s practice by just pulling the number of men who passed. Simply, type the name of the table object (observed
), followed by brackets [ ]
. Within the brackets, a number that comes before the comma represents the row number in the table or matrix object, and the number that comes after the comma represents the column number. To pull the number of men who passed the test, we would reference the cell from our table that corresponds to row number 1 and column number 2 (as shown below).
## [1] 70
Now we’re ready to pull the pass and fail counts for men and assign them to object names of our choosing. For clarity, I’m labeling these objects as pass_men
and fail_men
, and we use the left arrow (<-
) operator to assign the values to these new objects.
# Create pass and fail objects for men containing the raw frequencies/counts
pass_men <- observed[1,2]
fail_men <- observed[1,1]
If you’d like you can print the pass_men
and fail_men
objects to verify that the correct table values were assigned to each object.
By adding our pass_men
and fail_men
values together, we can compute the total number of men who participated in this cognitive ability test. Let’s call this new object N_men
.
# Create an object that contains the total number of men who participated in the test
N_men <- pass_men + fail_men
Let’s now proceed with doing the same for the number of women who passed and failed, as well as the total number of women.
# Create object containing number of women who passed, number of women who failed,
# and the total number of women who participated in the test
pass_women <- observed[2,2]
fail_women <- observed[2,1]
N_women <- pass_women + fail_women
We now have the ingredients necessary to compute the selection ratios for men and women. To calculate the selection ratio for men, simply divide the number of men who passed (pass_men
) by the total number of men (N_men
). Using the left arrow (<-
) operator, we will assign this quotient to an object that I’m arbitrarily calling SR_men
(for “selection ratio for men”).
Let’s do create an object called SR_women
that contains the selection ratio for women.
As the final computational step, we’ll create an object called IR
containing the impact ratio value. Simply divide our SR_women
object by the SR_men
object, and assign that quotient to an object called IR
using the left arrow (<-
) operator. In most cases, it is customary to set the group with the lower selection ratio as the numerator and the group with the higher selection ratio as the denominator. There can be exceptions, though, such as in instances in which a law is directional, such as the Age Discrimination in Employment Act of 1967, which at the federal level stipulates that workers older than 40 years of age are projected; in this particular case, we would typically always set the selection ratio for workers older 40 as the numerator and the selection ratio for workers younger than 40 as the denominator.
To view the IR
object, use the print
function.
## [1] 0.8421053
Because the impact ratio (IR) value is .84 and thus greater than .80 (80%, 4/5ths), then we would conclude based on this test that there is not evidence of disparate impact on this cognitive ability test when comparing the selection ratios of men and women. If the impact ratio had been less than .80, then we would have concluded that based on the 4/5ths Rule, there was evidence of disparate impact.
It is sometimes recommended that we apply the “flip-flop rule”, which is essentially a sensitivity test for the 4/5ths Rule test. This is often called the impact ratio adjusted (IR_adj
). To compute the impact ratio adjusted, we re-apply the impact ratio formula; however, this time we add in a hypothetical women who passed and subtract a hypothetical man who passed.
# Apply "flip-flop rule" (impact ratio adjusted)
IR_adj <- ((pass_women + 1) / N_women) / ((pass_men - 1) / N_men)
print(IR_adj)
## [1] 0.8746504
If the impact ratio adjusted (IR_adj
) value is less than 1.0, then the original interpretation of the impact ratio stands; in this example, the impact ratio adjusted value is less than 1.0, and thus we continue on with our original interpretation that there is no violation of the 4/5ths Rule and thus no evidence of disparate impact.
Finally, it’s important to note that the impact ratio associated with the 4/5ths Rule is an effect size and, thus, can be compared across samples and tests. With that being said, on its own, the 4/5ths Rule doesn’t produce a test of statistical significance. For a test of adverse impact that yield a statistical significance estimate, we should should tern to the \(\chi^2\) test of independence, Fisher Exact test, \(Z\)-test, or \(Z_{ImpactRatio}\).
37.2.4.1 Optional: Alternative Approach to Computing 4/5ths Rule
If you’d like to practice your R skills and learn some additional functions, you’re welcome to apply this mathematically and operationally equivalent approach to the 4/5ths Rule.
As a side note, it’s worth noting that the addmargins
function from base R can be used to automatically compute the row and column marginals for a table object. This can be a handy function for simplifying some operations, and matrix/bracket notation can still be used on a new table object that is created using the addmargins
function.
## Cognitive_Test
## Gender Fail Pass Sum
## Man 90 70 160
## Woman 72 42 114
## Sum 162 112 274
Now, check out the following steps to see an alternative approach to computing an impact ratio. Note that the proportions
function from base R is introduced and that we’re referencing the same table object (observed
) as as above.
# Convert table object values to proportions by Gender (columns)
prop_observed <- proportions(observed, "Gender")
# Compute selection ratios for men
SR_men <- prop_observed[1,2]
# Compute selection ratios for women
SR_women <- prop_observed[2,2]
# Compute impact ratio (IR)
IR <- SR_women / SR_men
# Print impact ratio (IR)
print(IR)
## [1] 0.8421053
37.2.5 Chi-Square (\(\chi^2\)) Test of Independence
The chi-square (\(\chi^2\)) test of independence is a statistical test than can be applied to evaluating whether this evidence of disparate impact. It’s relatively simple to compute. For background information on the \(\chi^2\) test of independence, please refer to this chapter
Using the chisq.test
function from base R, as the first argument, type the name of the table object containing the raw counts/frequencies (observed
). As the second argument, enter the argument correct=FALSE
, which stipulates that we wish for our \(\chi^2\) test to be computed the traditional way without a continuity correction.
##
## Pearson's Chi-squared test
##
## data: observed
## X-squared = 1.3144, df = 1, p-value = 0.2516
In the output, we should focus our attention on the p-value and whether it is equal to or greater than the conventional two-tailed alpha level of .05. In this case, p-value is greater than .05, so we would conclude that there appears to be no statistical association between whether someone passes or fails the cognitive ability test and whether they identify as a man or women. In other words, gender does not appear have an effect on whether someone passes or fails this particular test, and thus there is no evidence of disparate impact based on this statistical test. This corroborates what we found when test the 4/5ths Rule above. If the p-value had been less than .05, then we would have concluded there is statistical evidence of disparate impact.
For fun, we can append $observed
and $expected
to our function to see the observed and expected counts/frequencies that were used to compute the \(\chi^2\) test of independence behind the scenes.
## Cognitive_Test
## Gender Fail Pass
## Man 90 70
## Woman 72 42
## Cognitive_Test
## Gender Fail Pass
## Man 94.59854 65.40146
## Woman 67.40146 46.59854
As an aside, to apply the Yates continuity correction, we would simply flip correct=FALSE
to correct=TRUE
. This continuity correction is available when we are evaluating data from a 2x2 table, which is the case in this example. There is a bit of a debate regarding whether to apply this continuity correction. Briefly, this continuity correction was introduced to account for the fact that in the specific case of 2x2 contingency tables (like ours), \(\chi^2\) values tend to be upwardly biased. Others, however, counter that this test is too strict. For a more conservative test, apply the continuity correction.
# Compute chi-square test of independence (with Yates continuity correction)
chisq.test(observed, correct=TRUE)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: observed
## X-squared = 1.0441, df = 1, p-value = 0.3069
37.2.5.1 Optional: Compute Phi (\(\phi\)) Coefficient (Effect Size) for Chi-Square Test of Independence
The p-value we computed for the 2x2 chi-square (\(\chi^2\)) test of independence is a test of statistical significance, which informs whether we should treat the association between two categorical variables is statistically significant. The p-value, however, does not give us any indication of practical significance or effect size, where an effect size is a standardized metric that can be compared across tests and samples. As an indicator of effect size, the phi (\(\phi\)) coefficient can be computed for a 2x2 \(\chi^2\) test of independence, where both variables have two levels (i.e., categories). Important note: Typically, we would only compute the phi (\(\phi\)) coefficient in the event we observed a statistically significant \(\chi^2\) test of independence. As shown in the previous section, we did not find evidence that the \(\chi^2\) test of independence was statistically significant, so would not usually go ahead and compute an effect size. That being said, we will compute phi (\(\phi\)) coefficient just for demonstration purposes.
To compute the phi (\(\phi\)) coefficient, we will use the phi
function from the psych
package (Revelle 2023), which means we first need to install (if not already installed) the psych
package using the install.packages
function from base R and then access the package using the library
function from base R.
Now we’re ready to deploy the phi
function.
## [1] 0.06926145
The phi (\(\phi\)) coefficient is -.07. Because this is as an effect size, we can describe it qualitatively (e.g., small, medium, large). For a 2x2 contingency table, the phi (\(\phi\)) coefficient is equivalent to a Pearson product-moment correlation (r). Like a correlation (r) coefficient, the (\(\phi\)) coefficient can range from -1.00 (perfect negative association) to 1.00 (perfect positive association), with .00 indicating no association. Further, we can use the conventional correlation thresholds for describing the magnitude of the effect, which I display in the table below.
\(\phi\) | Description |
---|---|
.10 | Small |
.30 | Medium |
.50 | Large |
Because the absolute value of the phi (\(\phi\)) coefficient is .07 and thus falls below the .10 threshold for “small,” we might describe the effect as “negligible” or “very small.” Better yet, because the p-value associated with the \(\chi^2\) test of independence indicated that the association was nonsignificant, we should just conclude that there is no statistical association between whether someone passes or fails the cognitive ability test and whether they identify as a man or women. In other words, we treat the effect size as though it were zero.
Finally, one limitation of the phi (\(\phi\)) coefficient is that the upper limit of the observed coefficient will be attenuated to the extent that the two categorical variables don’t have a 50/50 distribution (Dunleavy, Morris, and Howard 2015), which is often the case in the specific context of selection ratios and adverse impact. For example, if the proportion of applicants who passed the selection test is anything but .50, then the estimated \(\phi\) value will not have the potential to reach the upper limits of -1.00 or 1.00. Similarly, if the proportion of applicants from one group relative to another group is anything but .50, then the estimated \(\phi\) value will not have the potential to reach the upper limits of -1.00 or 1.00.
37.2.6 Fisher Exact Test
In instances in which we have a small sample size (e.g., N < 30) and/or one of the 2x2 cells has an expected frequency that is less than 5, the Fisher exact test is a more appropriate test than other tests like the chi-square (\(\chi^2\)) test of independence or the \(Z\)-test (Federal Contract Compliance Programs 1993). The Fisher exact test is very simple to compute when using the fisher.test
function from base R. Simply, enter the name of the table object (observed
) as the sole parenthetical argument in the function.
##
## Fisher's Exact Test for Count Data
##
## data: observed
## p-value = 0.2643
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.4441286 1.2622356
## sample estimates:
## odds ratio
## 0.7507933
In the output, we should focus our attention on the p-value and whether it is equal to or greater than the conventional two-tailed alpha level of .05. In this case, p-value is greater than .05, so we would conclude that there appears to be no statistical association between whether someone passes or fails the cognitive ability test and whether they identify as a man or women. In other words, gender does not appear have an effect on whether someone passes or fails this particular test, and thus there is no evidence of disparate impact based on this statistical test. This corroborates what we found when test the 4/5ths Rule above. If the p-value had been less than .05, then we would have concluded there is statistical evidence of disparate impact based on this statistical test.
If we’d like to compute an effect size as an indicator of practical significance for a significant Fisher exact test, then we can compute the phi (\(\phi\)) coefficient (see previous section) – or another effect size like an odds ratio.
37.2.7 \(Z_{D}\) Test
The \(Z_{D}\) test – or just \(Z\) test – offers another method for evaluating whether evidence of disparate impact exists, and more specifically, this test tests whether there is a significant difference between two selection ratios. The \(Z_{D}\) test is also referred to as the Two Standard Deviation Test. Of note, the \(Z_{D}\) test is statistically equivalent to a 2x2 chi-square (\(\chi^2\)) test of independence, which means that they will yield the same p-value.
To compute the \(Z_{D}\) test, we’ll use the contingency table (cross-tabulation) object called observed
that we created previously. To begin, we’ll create objects called N_men
and N_woman
that represent the total number of men and women in our sample, respectively. First, we’ll use matrix notation to reference row 1 of our contingency table (observed[1,]
), which contains the number of men who passed and failed the cognitive ability test. Second, we’ll “wrap” row 1 from the matrix in the sum
function from base R.
We’ll do the same thing to compute the total number of women in our sample, and their counts reside row 2 of our contingency table.
Now it’s time to compute the total selection ratio for the sample, irrespective of gender identity, and we’ll assign the total selection ratio to an object that we’ll name SR_total
by using the <-
operator. To the right of the <-
, we will specify the ratio. First, we’ll specify the numerator (i.e., total number of individuals who passed the cognitive ability test) by computing the sum of column 2 in the observed
contingency table (sum(observed[,2])
). Second, we’ll divide the numerator (using the /
operator) by the denominator (i.e., total number of individuals in the sample) by computing the sum of the all counts in our contingency table (sum(observed)
). We’ll set aside the resulting SR_total
object until we’re ready to plug it into the \(Z_{D}\) test formula.
# Calculate marginal (overall, total) selection ratio/rate
SR_total <- sum(observed[,2]) / sum(observed)
Our next objective is to convert the counts in the observed
contingency table to proportions by row. Using the prop.table
function from base R, we will type the name of the contingency object as the first argument, and type 1 as the second argument, where the latter argument requests proportions by row. Using the <-
operator, we’ll assign the table with row proportions to an object that we’ll name prop_observed
.
Now we’re ready to assign the selection ratios for men and women, respectively, to objects that we’ll call SR_men
and SR_women
, and we’ll accomplish this by using the <-
operator. Our selection ratios for each gender identity have already been computed in the row proportions contingency table object that we created above (prop_observed
). The selection ratio for men appears in the cell corresponding to row 1 and column 2 (i.e., the proportion of men who passed the test), and the selection ratio for women appears in the cell corresponding to row 2 and column 2 (i.e., the proportion of women who passed the test). We’ll grab these values from the prop_observed
object using matrix notation and then assign them to their respective objects.
The time has come to apply the objects we created above to the \(Z_{D}\) test formula, which is shown below.
\(Z_{D} = \frac{SR_{women} - SR_{men}} {\sqrt{SR_{total}(1-SR_{total})(\frac{1}{N_{women}} + \frac{1}{N_{men}})}}\)
Let’s apply the objects we created using the mathematical operations detailed in the formula, and let’s assign the resulting \(Z\)-value to an object we’ll call Z
.
# Compute Z-value
Z <- (SR_women - SR_men) /
sqrt(
(SR_total*(1 - SR_total) * ((1/N_women) + (1/N_men)))
)
Using the abs
function from base R, let’s compute the absolute value of the Z
object so that we can compare it to the critical \(Z\)-values for statistical significance.
## [1] 1.146481
The absolute value of the \(Z\)-value is 1.15, which falls below the critical value for significance of 1.96 for two-tailed test with an alpha of .05, where alpha is our p-value cutoff for statistical significance. Based on this critical value, we can conclude that there is not a statistical difference between the selection ratios for men and women for the cognitive ability test based on this sample.
With that being said, some argue that a more liberal one-tailed test with alpha equal to .05 might be more appropriate, so as to avoid false negatives (Type II Error). The critical value for a one-tailed test with an alpha of .05 is 1.645. We can see that our observed \(Z\)-value of 1.15 is less than that critical value as well. Thus, at least with respect to the \(Z_{D}\) test, we can conclude that the two selection ratios are not statistically different from one another. More specifically, failure to exceed the critical values indicates that the difference between the two selection ratios for men and women is less than two standard deviations, which is where the other name for this test comes from (i.e., Two Standard Deviations Test)
In many cases, we might also want want to report the exact p-value for the \(Z_{D}\) test. To calculate the exact p-value for a one-tailed test, we will plug the absolute value of our \(Z\)-value into the first argument of the pnorm
function from base R; and as the second argument, we will specify lower.tail=FALSE
, to specify that only want the upper tail of the test, thereby requesting a one-tailed test.
## [1] 0.1257981
To compute a two-tailed p-value, we will take the pnorm
function from above, and multiple the resulting p-value by 2 using the *
(multiplication) operator.
## [1] 0.2515962
Both the one- and two-tailed p-values fall below the conventional alpha of .05, and thus both indicate that we should fail to reject the null hypothesis that there is zero difference between the population selection ratios for these two gender identities.
37.2.8 \(Z_{IR}\) Test
The \(Z_{IR}\) test is more a direct test of the 4/5ths Rule and associated impact ratio than, say, the chi-square (\(\chi^2\)) test of independence or the \(Z_{D}\) test, which if you recall is the selection ratio of one group divided by the selection ratio of another group. The \(Z_{IR}\) also tends to be a more appropriate test when sample sizes are small, when the selection ratios for specific groups are low, and when the proportion of the focal group (e.g., women) relative to the overall sample is small (Finch, Edwards, and Wallace 2009; Morris 2001; Morris and Lobsenz 2000).
The formula for the \(Z_{IR}\) test is as follows.
\(Z_{IR} = \frac{ln(\frac{SR_{women}}{SR_{men}})} {\sqrt{(\frac{ 1-SR_{total}}{SR_{total}})(\frac{1}{N_{women}} + \frac{1}{N_{men}})}}\)
Fortunately, when we prepared to compute the \(Z_{D}\) test above, we created all of the necessary components (i.e., objects) needed to compute the \(Z_{IR}\) test. So let’s plug them into our formula and assign the resulting \(Z_{IR}\) to an object called Z_IR
. Note that the log
function from base R computes the natural logarithm.
# Compute Z-value
Z_IR <- log(SR_women/SR_men) /
sqrt(((1 - SR_total)/SR_total) * (1/N_women + 1/N_men))
Using the abs
function from base R, let’s compute the absolute value of the Z_IR
object so that we can compare it to the critical \(Z\)-values for statistical significance.
## [1] 1.16584
The absolute value of the \(Z\)-value is 1.17, which falls below the critical value for significance of 1.96 for two-tailed test with an alpha of .05, where alpha is our p-value cutoff for statistical significance. Based on this critical value, we can conclude that the ratio of the selection ratios for men and women (i.e., the impact ratio) does not differ significantly from an impact ratio of 1.0, where the latter would indicate that the selection ratios are equal.
Like we did with the \(Z_{D}\) test, we can apply the more liberal one-tailed test with alpha equal to .05, where he critical value for a one-tailed test with an alpha of .05 is 1.645. We can see that our observed \(Z\)-value of 1.17 is also less than that critical value. Thus, with respect to the \(Z_{IR}\) test, we can conclude that the there is no relative difference between the selection ratios for men and women.
We might also want want to report the exact p-value for the \(Z_{IR}\) test. To calculate the exact p-value for a one-tailed test, we will plug the absolute value of our \(Z\)-value into the first argument of the pnorm
function from base R; and as the second argument, we will specify lower.tail=FALSE
, to specify that only want the upper tail of the test, thereby requesting a one-tailed test.
## [1] 0.1218396
To compute a two-tailed p-value, we will take the pnorm
function from above, and multiple the resulting p-value by 2 using the *
(multiplication) operator.
## [1] 0.2436793
Both the one- and two-tailed p-values are above the conventional alpha of .05, and thus both indicate that we should fail to reject the null hypothesis that there is zero relative difference between the population selection ratios for men and women with respect to the cognitive ability test.
37.2.8.1 Optional: Compute Confidence Intervals for \(Z_{IR}\) Test
Given that tests of disparate impact often involve small sample sizes and thus are susceptible to low statistical power, some have recommended that effect sizes and confidence intervals be used instead of interpreting statistical significance using a p-value (see Morris and Lobsenz 2000). Confidence intervals around an impact ratio estimate reflect sampling error and provide a range of possible values in which the population parameter (i.e., population impact ratio) may fall – or put differently, how the impact ratio will likely vary across different samples drawn from the same population.
To compute the confidence intervals for the \(Z_{IR}\) test, we’ll first need to compute the natural log of the impact ratio using the log
function from base R. We’ll assign the natural log of the impact ratio to an object that we’ll call IR_log
.
Next, we will compute the standard error of the impact ratio using the formula shown in the code chunk below, and we will assign it to an object that we’ll call SE_IR
.
# Compute standard error of IR (SE_IR)
SE_IR <- sqrt(
((1 - SR_women) /
(N_women * SR_women) + (1 - SR_men)/(N_men * SR_men))
)
Using the standard error of the impact ratio object (SE_IR
) and the natural log of the impact ratio object (IR_log
), we can compute the lower and upper limits of the a 95% confidence interval by adding and subtracting, respectively, the product of the standard error of the impact ratio and 1.96 to/from the natural log of the impact ratio.
# Compute bounds of 95% confidence interval for natural log
# of impact ratio
LCL_log <- IR_log - 1.96 * SE_IR # lower
UCL_log <- IR_log + 1.96 * SE_IR # upper
Finally, we can convert the lower and upper limits of the 95% confidence interval to the original scale of the impact ratio metric by exponentiating the lower and upper limits of the natural log of the impact ratio.
# Convert to scale of original IR metric
LCL <- exp(LCL_log) # lower
UCL <- exp(UCL_log) # upper
# Print the 95% confidence intervals
print(LCL)
## [1] 0.6252696
## [1] 1.134137
Thus, we can be 95% confident that the true population impact ratio falls somewhere between .63 and 1.13 (95% CI[.63, 1.13]), and we already know that the effect size (i.e., impact ratio) for this cognitive ability test with respect to gender is .84 based on our test of the 4/5ths Rule (see above). Note that if the population parameter were to fall at the lower limit of the confidence interval (.63), then we would conclude that there was disparate impact based on the 4/5ths Rule; however, if the the population parameter were to fall at the upper limit of the confidence interval (1.13), we would conclude that there is no evidence of disparate impact based on the 4/5ths Rule. Thus, the confidence interval indicates that if we were to sample from the same population many more times, we would sometimes find a violation of the 4/5ths Rule – but sometimes not.
If we want, we can also compute the 90% confidence intervals for the impact ratio by swapping the 1.96 critical \(Z\)-value for a two-tailed test (alpha = .05) with the 1.645 critical \(Z\)-value for a one-tailed test (alpha = .05). Everything else will remain the same as when we computed the 95% confidence interval.
# Compute bounds of 90% confidence interval for natural log
# of impact ratio
LCL_log <- IR_log - 1.645 * SE_IR # lower
UCL_log <- IR_log + 1.645 * SE_IR # upper
# Convert to scale of original IR metric
LCL <- exp(LCL_log) # lower
UCL <- exp(UCL_log) # upper
# Print the 90% confidence intervals
print(LCL)
## [1] 0.655915
## [1] 1.081148
Based on these calculations, we can be 90% confident that the true population impact ratio falls somewhere between .66 and 1.08 (90% CI[.66, 1.08]).