# Chapter 63 Identifying Pay Determinants & Evaluating Pay Equity Using Hierarchical Linear Regression

In this chapter, we will learn how to identify pay determinants using hierarchical linear regression. Identifying pay determinants is an important part of pay equity analyses.

## 63.1 Conceptual Overview

**Pay determinants** refer to any characteristics, behaviors, or conditions that are associated with the amount of pay employees receive. For example, in an organization that monetarily rewards those with longer lengths of service, length of service will likely be a pay determinant. Ideally, pay determinants should be legally defensible and should be related to job performance and other characteristics or behaviors that are valued by the organization. We can use hierarchical linear regression to identify pay determinants, evaluate their incremental validity, and determine whether there is evidence of pay inequity based on protected class variables.

### 63.1.1 Review of Hiearchical Linear Regression

**Hierarchical linear regression** is an extension of multiple linear regression, where the latter was introduced in the chapter on incremental validity. The approach simply means that we are building up to a full model in what are called **steps** or **blocks**, wherein each step or block we add more predictor variables to our model until we eventually reach whatever we have specified as our full model. There are different reasons why we might do this. For instance, we might want to know if, collectively, two or more additional predictor variables explain incremental variance in the outcome variable after accounting for the other predictor variables already in the model. In doing so, we can compare what are referred to as nested models. A **nested model** model is a model that is missing one or more of the predictor variables of a full model (i.e., larger model) that contains the same predictor variables as the nested model as well as additional predictor variables. In practice, it is fairly simple to implement hierarchical linear regression as long as you are already comfortable with multiple linear regression.

#### 63.1.1.1 Statistical Assumptions

Because we will be performing hierarchical linear regression using multiple linear regression, to estimate and interpret the associated models, we need to make sure we have satisfied the statistical assumptions of multiple linear regression, which are described here.

#### 63.1.1.2 Statistical Significance

For information regarding statistical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

#### 63.1.1.3 Practical Significance

For information regarding practical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

### 63.1.2 Conceptual Videos

For a more in-depth review of pay determinants, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/EMC2aBvjB2s

For an introduction to pay equity and how hierarchical linear regression can be applied to identify potential pay inequity, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/L4HysNZn97U

For a more in-depth review of hierarchical linear regression, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/Ney5uM6sWgY

#### 63.1.2.1 Sample Write-Up

Organizational leaders were interested in determining whether the pay raise percentages that Customer Service Representatives (CSRs) received showed evidence of pay inequity based on the protect characteristics of sex and race. To identify the pay determinants for the 132 CSRs, we performed followed a two-step hierarchical linear regression process suggested by Sady and colleagues (2017), and as advised, throughout the process we sought guidance from the organization’s legal counsel, an Equal Employment Opportunity (EEO) compliance employee, a senior compensation employee, and a compensation analyst.

In the first step, we estimated a multiple linear regression model in which the percentage pay raise variable served as the outcome variable and education level, job performance ratings, and seniority served as the merit-based predictor variables. We included education level, job performance ratings, and seniority as merit-based predictor variables because the senior compensation employee indicated that these factors are used by the organization to determine the size of CSRs’ pay raises. We found that both education level and job performance ratings were statistically significantly and positively associated with percentage pay raise received when statistically controlling other predictor variables in the model (*b* = .217, *p* < .001 and *b* = .296, *p* < .001, respectively), such that higher education levels and higher job performance ratings were associated with larger percentage pay raises; however, seniority was not not associated with percentage pay raise to a statistically significant extent (*b* = .217, *p* = .101). In terms of model fit, the unadjusted *R*^{2} and adjusted *R*^{2} values were .384 and .374, respectively, where the unadjusted *R*^{2} value indicates that approximately 38% of the variance in percentage pay raise received by CSRs was explained by their education level, job performance ratings, and seniority. As evidenced by the *F*-test, the model fit the data better than a model containing no predictor variables (*F* = 40.162, *p* < .001). In terms of practical significance, we can conclude that the first-step model’s fit to the data is large, as evidence by the unadjusted *R*^{2}.

In the second step, we added sex and race as protected-characteristics predictor variables to a multiple linear regression model that also included the same merit-based predictor variables as our first model (i.e., education level, job performance ratings, seniority); that is, our first-step model is nested within our second-step model. Both education level and job performance ratings remained statistically significantly and positively associated with the percentage pay raise received when controlling for other predictor variables in the model (*b* = .219, *p* < .001 and *b* = .283, *p* = .009, respectively), and seniority remained nonsignificant when controlling for the other predictor variables in the model (*b* = .205, *p* = .119). Regarding the protected class variables, we re-leveled both the sex and race variables to designate our chosen reference groups. For sex, we designated male employees as the reference group because the company employed more male employees than female employees and because, historically within the United States, males have tended to be the advantaged group in terms of pay. Similarly, for race, we designated white employees as the reference group because the company employed more white employees than other race groups and because, historically within the United States, white individuals tended to be the advantaged group in terms of pay. On the one hand, sex was not statistically significantly associated with the percentage pay raise received when controlling for the other variables in the model (*b* = .162, *p* = .167), which indicates that female CSRs were not, on average, paid differently than male CSRs when holding constant other predictor variables in the model. On the other hand, employee race was significantly associated with the percentage pay raise received, when statistically controlling for other predictor variables; specifically, white employees received significantly higher percentage pay raises, on average, than both Asian employees (*b* = -1.182, *p* < .001) and Black CSRs (*b* = -.595, *p* < .001). In terms of model fit, the unadjusted *R*^{2} and adjusted *R*^{2} values were .592 and .576, respectively, where the unadjusted *R*^{2} value indicates that approximately 59% of the variance in percentage pay raise received by CSRs was explained by their education level, job performance ratings, seniority, sex, and race. As evidenced by the *F*-test, the model fit the data better than a model containing no predictor variables (*F* = 36.613, *p* < .001). In terms of practical significance, we can conclude that the second-step model’s fit to the data is large, as evidence by the unadjusted *R*^{2}.

Because the first-step model is nested within the second-step model, we performed a nested model comparison and found that the second-step model fit the data statistically significantly better than the first-step model (*F* = 21.492, *p* < .001). In terms of practical significance, the difference in unadjusted *R*^{2} was .209, which indicates that the second-step model explained an additional 21% of the variance in percentage pay raise when compared to the first-step model, which can be considered a medium-to-large difference.

In sum, when statistically controlling for the merit-based predictor variables of education level, job performance ratings, and seniority, the protected characteristic variable of sex did not show evidence of incremental validity, whereas the protected variable of race did; these findings provide evidence of pay inequity based on race, but not based on sex, in terms of CSRs’ percentage pay raises. Moreover, collectively, sex and race explained an additional 21% of the variability in percentage pay raises.

## 63.2 Tutorial

This chapter’s tutorial demonstrates how to perform hierarchical linear regression in R.

### 63.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/OGXX3ldUMoI

### 63.2.2 Functions & Packages Introduced

Function | Package |
---|---|

`lm` |
base R |

`summary` |
base R |

`nrow` |
base R |

`model.frame` |
base R |

`anova` |
base R |

`tapply` |
base R |

`mean` |
base R |

### 63.2.3 Initial Steps

If you haven’t already, save the file called **“PayDeterminants.csv”** into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., `"H:/RWorkshop"`

). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the `setwd`

function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to *Session > Set Working Directory > Choose Directory…*. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

Next, read in the .csv data file called **“PayDeterminants.csv”** using your choice of read function. In this example, I use the `read_csv`

function from the `readr`

package (Wickham, Hester, and Bryan 2024). If you choose to use the `read_csv`

function, be sure that you have installed and accessed the `readr`

package using the `install.packages`

and `library`

functions. *Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months.* For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

```
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
```

```
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
pd <- read_csv("PayDeterminants.csv")
```

```
## Rows: 132 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (3): raise, educ, perf
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

`## [1] "raise" "educ" "perf" "sex" "race"`

```
## spc_tbl_ [132 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ raise: num [1:132] 2.4 1.7 2.6 1.1 3.6 2.4 1 0.6 2.2 2.5 ...
## $ educ : num [1:132] 3 3 5 5 4 2 2 2 4 4 ...
## $ perf : num [1:132] 7.1 3.8 10 4 10 8.9 9.9 5.4 7.4 7.5 ...
## $ sex : chr [1:132] "female" "female" "male" "female" ...
## $ race : chr [1:132] "asian" "black" "black" "black" ...
## - attr(*, "spec")=
## .. cols(
## .. raise = col_double(),
## .. educ = col_double(),
## .. perf = col_double(),
## .. sex = col_character(),
## .. race = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
```

```
## # A tibble: 6 × 5
## raise educ perf sex race
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 2.4 3 7.1 female asian
## 2 1.7 3 3.8 female black
## 3 2.6 5 10 male black
## 4 1.1 5 4 female black
## 5 3.6 4 10 female black
## 6 2.4 2 8.9 male black
```

`## [1] 132`

There are 5 variables and 132 cases (i.e., employees) in the `pd`

data frame: `raise`

, `educ`

, `perf`

, `sex`

, and `race`

. Per the output of the `str`

(structure) function above, the `raise`

, `educ`

, and `perf`

variables are of type *numeric*, and the `sex`

and `race`

variables are of type *character*. The `raise`

variable represents the percentage annual raise each employee received. The `educ`

variable is ordinal but numerically coded and can be treated as a continuous variable in this context, where 1 = high school diploma or GED, 2 = some college, 3 = associate’s degree, 4 = bachelor’s degree, and 5 = graduate certificate or degree. The `perf`

variable contains the supervisor-rated annual performance rating scores, which can range from 1-10, with 10 indicating very high performance. The `sex`

variable refers to employee sex (female, male). The `race`

variable contains three level (asian, black, white).

### 63.2.4 Perform Hierarchical Linear Regression

I will demonstrate one approach to running hierarchical linear regression models, which is fairly straightforward to implement. The approach requires the use of the `lm`

and `anova`

functions from base R. Specifically, I will demonstrate hierarchical linear regression using a two-step (i.e., two-block) example. For both steps/blocks, the percentage pay raise (`raise`

) will be the outcome variable. In the first step/block, we will specify a model in which education level (`educ`

) and performance level (`perf`

) are the predictor variables, as these are expected to be acceptable determinants of pay raise percentage. In the second step/block, we will specify the full model in which sex (`sex`

) and race/ethnicity (`race`

) are added to the model as additional predictor variables, where `sex`

and `race`

are considered unacceptable determinants of pay raise percentage. Our goal will be two determine if the `sex`

and `race`

variables collectively explain a statistically significant amount of incremental variance in the outcome variable `raise`

after accounting for the variance explained by `educ`

and `perf`

. If we find significant incremental variance explained in `raise`

by `sex`

and `race`

, then we potentially have evidence of managers exhibiting implicit or explicit bias when determining pay raise percentages for employees.

*Note:* To save space, I don’t review the statistical assumption testing that should be conducted when estimating a multiple linear regression model. When analyzing real data, be sure to carry out those diagnostic tests, and pay close attention to collinearity. To learn how to test the statistical assumptions for a multiple linear regression model, please refer to the chapter on estimating the incremental validity of a selection tool using multiple linear regression.

#### 63.2.4.1 Step One (Block One)

In the first step (first block), we will specify a multiple linear regression model with `raise`

as the outcome variable and `educ`

and `perf`

as the predictor variables. We are going to use the `lm`

(linear model) function from base R instead of the `Regression`

or `reg.brief`

functions from `lessR`

because the models we create using the `lm`

function are a bit easier to work with (in my opinion) when it comes to hierarchical linear regression. Using the `<-`

assignment operator, let’s name the model object `step1`

to reference the fact that it is our first step in the hierarchical linear regression process. Next, type the name of the `lm`

function. As the first parenthetical argument, specify the following regression model: `raise ~ educ + perf`

. Note that the tilde `~`

symbol separates the outcome variable (left of the tilde) from the predictor variables (right of the tilde). As the second argument, type `data=`

followed by the name of the data frame object (`pd`

) in which our model variables can be found. On a subsequent line, type the name of the `summary`

function from base R, with the name of our regression model object (`step1`

) as the sole parenthetical argument. Finally, to determine the number of cases retained for the analysis (which becomes relevant if any data were missing on the focal variables), use the `nrow`

function from base R with the `model.frame`

function from base R as an argument. As the sole argument for the `model.frame`

function, type the name of the regression model object (`step1`

).

```
# Step 1 Model: Regress raise on educ and perf
step1 <- lm(raise ~ educ + perf, data=pd)
summary(step1)
```

```
##
## Call:
## lm(formula = raise ~ educ + perf, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02060 -0.51293 -0.06175 0.51531 2.32785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.59071 0.33450 -1.766 0.079767 .
## educ 0.21700 0.06016 3.607 0.000441 ***
## perf 0.29603 0.03607 8.207 0.000000000000201 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7999 on 129 degrees of freedom
## Multiple R-squared: 0.3837, Adjusted R-squared: 0.3742
## F-statistic: 40.16 on 2 and 129 DF, p-value: 0.00000000000002752
```

`## [1] 132`

Let’s review the findings from the first step of our hierarchical linear regression. Using a sample of 132 employees, we estimated the first step in a two-step hierarchical linear regression. Specifically, we regressed the percentage pay raise received by employees (`raise`

) on their education level (`educ`

) and supervisor-rated performance (`perf`

). We found that both education level (*b* = .217, *p* < .001) and performance (*b* = .296, *p* < .001) were significantly and positively associated with the percentage raise received when controlling for one another. The (unadjusted multiple) *R*^{2} and adjusted *R*^{2} values are .384 and .374 (*F* = 40.16, *p* < .001); focusing on the unadjusted *R*^{2} value, approximately 38% of the variance in percentage pay raise received by employees is explained by their education level and supervisor-rated performance scores. Because the *F*-value associated with the model is statistically significant (*p* < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

#### 63.2.4.2 Step Two (Block Two)

In the second step (second block), we will specify a multiple linear regression model with `raise`

as the outcome variable and `educ`

and `perf`

as the predictor variables, but we will also add the predictor variables of `sex`

and `race`

. For this model, let’s name the model object `step2`

.

```
# Step 2 Model: Regress raise on educ, perf, sex, and race
step2 <- lm(raise ~ educ + perf + sex + race, data=pd)
summary(step2)
```

```
##
## Call:
## lm(formula = raise ~ educ + perf + sex + race, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65048 -0.42558 -0.00791 0.45156 2.32498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.22478 0.29597 -4.138 0.00006355901993 ***
## educ 0.21890 0.04959 4.414 0.00002159553969 ***
## perf 0.28282 0.02988 9.464 < 0.0000000000000002 ***
## sexmale 0.16160 0.11633 1.389 0.167236
## raceblack 0.58704 0.14852 3.953 0.000128 ***
## racewhite 1.18247 0.15405 7.676 0.00000000000394 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6583 on 126 degrees of freedom
## Multiple R-squared: 0.5923, Adjusted R-squared: 0.5761
## F-statistic: 36.61 on 5 and 126 DF, p-value: < 0.00000000000000022
```

`## [1] 132`

Let’s review the findings from the second step of our hierarchical linear regression. Using the same sample of 132 employees as the first step, we regressed the percentage pay raise received by employees (`raise`

) on their education level (`educ`

), supervisor-rated performance (`perf`

), sex (`sex`

), and race/ethnicity (`race`

). Both education level (*b* = .219, *p* < .001) and performance (*b* = .283, *p* < .001) remained significantly and positively associated with the percentage raise received when controlling for one another and the sex and race/ethnicity variables. On the one hand, employee sex was not significantly associated with the percentage pay raise received when controlling for the other variables in the model (*b* = .162, *p* = .167), indicating that females were not paid differently than males, on average. On the other hand, employee race/ethnicity was significantly associated with percentage pay raise received. Because the employee race/ethnicity variable (`race`

) is categorical and has three or more levels (i.e., asian, black, white), the `lm`

function uses a process called **dummy coding** (**contrast coding**) to make comparisons between the different races/ethnicities with respect to the `raise`

outcome variable. For more information on dummy coding, see the section called Optional: More Information on Dummy Coding below. By default, the `lm`

function orders the `race`

categories alphabetical order, and the first category alphabetically automatically serves as the reference category, which in this case is the *asian* category. The results indicate that *black* employees received significantly higher percentage pay raises than *asian* employees (*b* = .587, *p* < .001) when controlling for other predictor variables in the model, and likewise *white* employees received significantly higher percentage pay raises than *asian* employees (*b* = 1.182, *p* < .001). As you might have noticed, the output does not give us the comparison between *black* and *white* employees.

To compare those two groups, we need to re-level the race variable; in fact, we often re-level categorical protected class variables before estimating the second-step mutliple linear regression model in order to designate specific reference groups that have been historically or systemically advantaged or that have the largest proportion of employees in the organization. To do so, we will overwrite the existing `race`

variable in our `pd`

data frame object using the `<-`

operator. Using the `relevel`

function from base R, we will specify the `as.factor`

function from base R as the first argument; within the `as.factor`

function parentheses, we will specify the name of the data frame object `pd`

followed by the `$`

operator and the name of the `race`

variable. As the second argument in the `relevel`

function, we will specify the `ref=`

argument followed by the name of the category that we wish to set as the new reference group, which in this case is *white*.

```
# Re-level the race variable to make white the reference group
pd$race <- relevel(as.factor(pd$race), ref="white")
```

With the `race`

variable re-leveled, we are ready to re-estimate our multiple linear regression model.

```
# Step 2 Model: Regress raise on educ, perf, sex, and re-leveled race variable
step2 <- lm(raise ~ educ + perf + sex + race, data=pd)
summary(step2)
```

```
##
## Call:
## lm(formula = raise ~ educ + perf + sex + race, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65048 -0.42558 -0.00791 0.45156 2.32498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04231 0.29843 -0.142 0.887
## educ 0.21890 0.04959 4.414 0.00002159553969 ***
## perf 0.28282 0.02988 9.464 < 0.0000000000000002 ***
## sexmale 0.16160 0.11633 1.389 0.167
## raceasian -1.18247 0.15405 -7.676 0.00000000000394 ***
## raceblack -0.59543 0.13203 -4.510 0.00001465492514 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6583 on 126 degrees of freedom
## Multiple R-squared: 0.5923, Adjusted R-squared: 0.5761
## F-statistic: 36.61 on 5 and 126 DF, p-value: < 0.00000000000000022
```

After we re-leveled the race variable to make *white* employees the reference group, *black* employees and *asian* employees both received, no average, statistically significantly lower percentage pay raises than *white* employees when controlling for the other predictor variables in the model (*b* = -.595, *p* < .001 and *b* = -1.182, *p* < .001, respectively). All other results remain the same as the version of the model we estimated prior to re-leveling the race variable, with the exception of the model intercept (which is not of substantive interest) in these analyses.

In sum, there is evidence that employee race/ethnicity influenced manager decisions when it came to determining the pay raise percentages for employees, which could be evidence of implicit or explicit bias. The (unadjusted multiple) *R*^{2} and adjusted *R*^{2} values are .592 and .576 (*F* = 36.61, *p* < .001); focusing on the unadjusted *R*^{2} value, approximately 59% of the variance in percentage pay raise received by employees is explained by their education level, supervisor-rated performance scores, employee sex, and employee race/ethnicity. Because the *F*-value associated with the model is statistically significant (*p* < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

#### 63.2.4.3 Compare Nested Models

As a final step in our hierarchical linear regression, let’s use the `anova`

function from base R to do a nested model comparison between the `step1`

and `step2`

models. This will tell us whether adding `sex`

and `race`

to the model significantly improved model fit (i.e., explained incrementally more variance in the outcome) over and above `educ`

and `perf`

. Type the name of the `anova`

function. As the first argument, type the name of our first step model object (`step1`

), and as the second argument, type the name of our second step model object (`step2`

).

```
## Analysis of Variance Table
##
## Model 1: raise ~ educ + perf
## Model 2: raise ~ educ + perf + sex + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 129 82.541
## 2 126 54.604 3 27.936 21.488 0.00000000002637 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The *F*-value and its associated *p*-value (the latter of which is less than the conventional cutoff of .05) indicate that the second step model fits the data significantly better than the first step model. In other words, `sex`

and `race`

explain incremental variance in `raise`

beyond the variance already explained by `educ`

and `perf`

. In the context of a pay raise, this is of course concerning because an employee’s sex or race/ethnicity should not factor into the percentage raise they receive.

Finally, let’s reference the summary/output for each model and specifically the `r.squared`

values to determine what the change in *R*^{2} was from the first step to the second step. The change in *R*^{2} gives as an idea of how much (in terms of practical significance) the model fit improved when adding the additional variables (i.e., `sex`

, `race`

). Note that we use the `$`

symbol to indicate that we want the `r.squared`

value from the `summary(step2)`

and `summary(step1)`

output, and then we do a simple subtraction to get the change in *R*^{2}. Note that we typically only interpret the change in *R*^{2} if we find that the models fit the data significantly differently from one another based on the *F*-value and its associated *p*-value in the model comparison step.

`## [1] 0.2085752`

As you can see, the change in *R*^{2} is .209, which indicates that `sex`

and `race`

explain approximately 21% additional variance in `raise`

beyond the variance already explained by `educ`

and `perf`

. If we consider our conventional rules of thumb for interpreting *R*^{2} as an effect size (i.e., indicator of practical significance), this is medium-to-large change, which is impressive in terms of magnitude (and deeply concerning in this specific context).

R^{2} |
Description |
---|---|

.01 | Small |

.09 | Medium |

.25 | Large |

#### 63.2.4.4 Optional: More Information on Dummy Coding

**Dummy coding** (**contrast coding**) is used in regression when we have a categorical predictor variable. To illustrate dummy coding, I reference the results from Step 2 above. In a simple two-level/category categorical variable such as `sex`

(*female*, *male*) just one dummy variable is needed to represent `sex`

, and by default, the `lm`

function uses the name of the level/category that comes first alphabetically as the reference group, setting it equal to 0 (e.g., *female* = 0); in turn, the name of the level/category that comes next alphabetically is used as the focal group and is set equal to 1 (e.g., *male* = 1). Using this coding, a positive and significant regression coefficient for *male* (i.e., *sexmale*) would indicate that male employees have higher scores/values on the outcome variable. Things get a little bit trickier when we have a categorical variable with three or more categories/levels. Using the present findings as an example, behind the scenes, the `lm`

function has created two dichotomous “dummy” variables with respect to the `race`

variable, both of which reference the focal level/category (e.g., *raceblack*) as 1 and all other levels/categories (e.g., *asian*, *white*) as 0. The number of dummy variables created to analyze a categorical variable is equal to the number of categories/levels (*k*) minus 1 (*k* - 1); in this example, we have three categories/levels (*asian*, *black*, *white*), and thus two dummy variables were created behind the scenes. Only two dummy variables are needed for a categorical variable with three levels/categories because it can be inferred what a third dummy variable’s values would be if we have two dummy variables’ values. In this example, because the *asian* level/category comes first alphabetically, it is treated as the reference group, which means its mean value is reflected in the intercept/constant value of -1.22478 when controlling for other variables in the model. We can interpret each dummy variable regression coefficient as the difference between the mean of the focal level/category (e.g., *raceblack*) and the intercept (e.g., *asian*) when controlling for other variables in the model. Thus, in the second step model above, *black* employees received significantly higher percentage pay raises than *asian* employees (*b* = .587, *p* < .001) when controlling for other predictor variables in the model, and likewise *white* employees received significantly higher percentage pay raises than *asian* employees (*b* = 1.182, *p* < .001).

Let’s dig deeper into dummy coding and illustrate its connection to analysis of variance (ANOVA). For solely explanatory purposes, let’s take a look at a model that only contains the dummy variables for the `race`

variable in relation to the `raise`

variable. Let’s name this model object `mod`

.

```
# Example model with dummy coding: Regress raise on race
mod <- lm(raise ~ race, data=pd)
summary(mod)
```

```
##
## Call:
## lm(formula = raise ~ race, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78478 -0.59925 0.00645 0.61522 2.31522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8848 0.1313 21.965 < 0.0000000000000002 ***
## raceasian -1.2912 0.2070 -6.238 0.00000000586 ***
## raceblack -0.6684 0.1780 -3.756 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8908 on 129 degrees of freedom
## Multiple R-squared: 0.2358, Adjusted R-squared: 0.224
## F-statistic: 19.9 on 2 and 129 DF, p-value: 0.00000002925
```

Note that in this model, the intercept/constant value of 1.594 represents the mean percentage raise for *asian* employees. The regression coefficient associated with *raceblack* (*b* = .623) represents the mean percentage raise for *black* employees minus the mean percentage raise for *asian* employees; because the values is positive, it means that *black* employees have a descriptively higher mean than *asian* employees. Next, the regression coefficient associated with *racewhite* (*b* = 1.291) represents the mean percentage raise for *black* employees minus the mean percentage raise for *asian* employees; because the values is positive, it means that *white* employees also have a descriptively higher mean than *asian* employees.

We can confirm that the regression coefficient values mean outcome for each level of the categorical variable by using the `tapply`

and `mean`

functions to calculate the mean for the `raise`

variable for each level of the `race`

variable.

```
# Mean values on outcome variable for each level of categorical variable
tapply(pd$raise, pd$race, mean)
```

```
## white asian black
## 2.884783 1.593548 2.216364
```

Note that the intercept/constant value for above, which represents the mean percentage raise for the reference group (*asian* employees), is 1.594; this is the same as the mean we just calculated. In addition, if we add the regression coefficient for the dummy variable associated with *black* employees to the mean of the reference group (*asian* employees), we get 2.216 (1.5935 + .6228 = 2.2164), which is equal to the mean for *black* employees we calculated. Finally, if we add the regression coefficient for the dummy variable associated with *white* employees to the mean of the reference group (*asian* employees), we get 2.885 (1.5935 + 1.2912 = 2.8848), which is equal to the mean for *white* employees we calculated.

Next, using the `anova`

function from base R, we can interpret the regression model (`mod`

) as a one-way ANOVA.

```
## Analysis of Variance Table
##
## Response: raise
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 31.585 15.7927 19.904 0.00000002925 ***
## Residuals 129 102.353 0.7934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The row in the table associated with *race* can be interpreted as an omnibus *F*-test for a one-way ANOVA. As you can see, the `race`

variable explained a significant amount of the variability in `raise`

(*F* = 19.904, *p* < .001). However, like any omnibus test with three or more levels for the categorical predictor variable, we don’t know yet where the significant mean differences exist. The summary of the regression model above, provides us with information about those mean comparisons in the form of regression coefficients. In addition to illustrating how dummy coding works, this also shows that a one-way ANOVA (and other types of ANOVA) can be implemented and interpreted in a regression framework. This just goes to show that ANOVA is a specific type of regression. Please note, however, that this ANOVA model did not include the merit-based predictor variables and the protected characteristic variable of sex, so its results cannot be compared directly to the results of our Step 2 multiple linear regression model above.

Finally, as a note of caution, it is not uncommon for a dummy variable (or multiple dummy variables) to be imbalanced, meaning that one category has disproportionately more representation in the data than the other. As such, it’s generally a good idea to compute the counts/frequencies for a categorical variable to determine if there are any categories that have disproportionately lower representation than others. If you do find that some categories have very few cases, then those categories should probably be filtered out prior to estimating a model wherein dummy variables will be created.

### 63.2.5 Summary

In this chapter, we learned how to perform hierarchical linear regression in order to identify pay determinants. Hierarchical linear regression is a specific application of multiple linear regression in which we build up to a final/full model in what are often called steps or blocks. Specifically, we are comparing nested models. In this tutorial, we learned how to use the `lm`

and `anova`

functions from base R to conduct hierarchical linear regression.

### References

*National Conference on EEO Law*.

*Readr: Read Rectangular Text Data*. https://CRAN.R-project.org/package=readr.