# Chapter 59 Identifying Pay Determinants Using Hierarchical Linear Regression

In this chapter, we will learn how to identify pay determinants using hierarchical linear regression.

## 59.1 Conceptual Overview

Pay determinants refer to any characteristics, behaviors, or conditions that are associated with the amount of pay employees receive. For example, in an organization that monetarily rewards those with longer lengths of service, length of service will likely be a pay determinant. Ideally, pay determinants should be legally defensible and should be related to job performance and other characteristics or behaviors that are valued by the organization. We can use hierarchical linear regression to identify pay determinants and evaluate their incremental validity.

### 59.1.1 Review of Hiearchical Linear Regression

**Hierarchical linear regression** is an extension of multiple linear regression, where the latter was introduced in the chapter on incremental validity. The approach simply means that we are building up to a full model in what are called **steps** or **blocks**, wherein each step or block we add more predictor variables to our model until we eventually reach whatever we have specified as our full model. There are different reasons why we might do this. For instance, we might want to know if, collectively, two or more additional predictor variables explain incremental variance in the outcome variable after accounting for the other predictor variables already in the model. In doing so, we can compare what are referred to as nested models. A **nested model** model is a model that is missing one or more of the predictor variables of a full model (i.e., larger model) that contains the same predictor variables as the nested model as well as additional predictor variables. In practice, it is fairly simple to implement hierarchical linear regression as long as you are already comfortable with multiple linear regression.

#### 59.1.1.1 Statistical Assumptions

Because we will be performing hierarchical linear regression using multiple linear regression, to estimate and interpret the associated models, we need to make sure we have satisfied the statistical assumptions of multiple linear regression, which are described here.

#### 59.1.1.2 Statistical Significance

For information regarding statistical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

#### 59.1.1.3 Practical Significance

For information regarding practical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

### 59.1.2 Conceptual Videos

For a more in-depth review of pay determinants, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/EMC2aBvjB2s

For a more in-depth review of hierarchical linear regression, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/Ney5uM6sWgY

## 59.2 Tutorial

This chapter’s tutorial demonstrates how to perform hierarchical linear regression in R.

### 59.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/OGXX3ldUMoI

### 59.2.2 Functions & Packages Introduced

Function | Package |
---|---|

`lm` |
base R |

`summary` |
base R |

`nrow` |
base R |

`model.frame` |
base R |

`anova` |
base R |

`tapply` |
base R |

`mean` |
base R |

### 59.2.3 Initial Steps

If you haven’t already, save the file called **“PayDeterminants.csv”** into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., `"H:/RWorkshop"`

). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the `setwd`

function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to *Session > Set Working Directory > Choose Directory…*. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

```
# Set your working directory
setwd("H:/RWorkshop")
```

Next, read in the .csv data file called **“PayDeterminants.csv”** using your choice of read function. In this example, I use the `read_csv`

function from the `readr`

package (Wickham, Hester, and Bryan 2022). If you choose to use the `read_csv`

function, be sure that you have installed and accessed the `readr`

package using the `install.packages`

and `library`

functions. *Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months.* For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

```
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
```

```
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
<- read_csv("PayDeterminants.csv") pd
```

```
## Rows: 132 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (3): raise, educ, perf
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# Print the names of the variables in the data frame (tibble) object
names(pd)
```

`## [1] "raise" "educ" "perf" "sex" "race"`

```
# Print variable type for each variable in data frame (tibble) object
str(pd)
```

```
## spec_tbl_df [132 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ raise: num [1:132] 2.4 1.7 2.6 1.1 3.6 2.4 1 0.6 2.2 2.5 ...
## $ educ : num [1:132] 3 3 5 5 4 2 2 2 4 4 ...
## $ perf : num [1:132] 7.1 3.8 10 4 10 8.9 9.9 5.4 7.4 7.5 ...
## $ sex : chr [1:132] "female" "female" "male" "female" ...
## $ race : chr [1:132] "asian" "black" "black" "black" ...
## - attr(*, "spec")=
## .. cols(
## .. raise = col_double(),
## .. educ = col_double(),
## .. perf = col_double(),
## .. sex = col_character(),
## .. race = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
```

```
# Print first 6 rows of data frame (tibble) object
head(pd)
```

```
## # A tibble: 6 × 5
## raise educ perf sex race
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 2.4 3 7.1 female asian
## 2 1.7 3 3.8 female black
## 3 2.6 5 10 male black
## 4 1.1 5 4 female black
## 5 3.6 4 10 female black
## 6 2.4 2 8.9 male black
```

```
# Print number of rows in data frame (tibble) object
nrow(pd)
```

`## [1] 132`

There are 5 variables and 132 cases (i.e., employees) in the `pd`

data frame: `raise`

, `educ`

, `perf`

, `sex`

, and `race`

. Per the output of the `str`

(structure) function above, the `raise`

, `educ`

, and `perf`

variables are of type *numeric*, and the `sex`

and `race`

variables are of type *character*. The `raise`

variable represents the percentage annual raise each employee received. The `educ`

variable is ordinal but numerically coded and can be treated as a continuous variable in this context, where 1 = high school diploma or GED, 2 = some college, 3 = associate’s degree, 4 = bachelor’s degree, and 5 = graduate certificate or degree. The `perf`

variable contains the supervisor-rated annual performance rating scores, which can range from 1-10, with 10 indicating very high performance. The `sex`

variable refers to employee sex (female, male). The `race`

variable contains three level (asian, black, white).

### 59.2.4 Perform Hierarchical Linear Regression

I will demonstrate one approach to running hierarchical linear regression models, which is fairly straightforward to implement. The approach requires the use of the `lm`

and `anova`

functions from base R. Specifically, I will demonstrate hierarchical linear regression using a two-step (i.e., two-block) example. For both steps/blocks, the percentage pay raise (`raise`

) will be the outcome variable. In the first step/block, we will specify a model in which education level (`educ`

) and performance level (`perf`

) are the predictor variables, as these are expected to be acceptable determinants of pay raise percentage. In the second step/block, we will specify the full model in which sex (`sex`

) and race/ethnicity (`race`

) are added to the model as additional predictor variables, where `sex`

and `race`

are considered unacceptable determinants of pay raise percentage. Our goal will be two determine if the `sex`

and `race`

variables collectively explain a statistically significant amount of incremental variance in the outcome variable `raise`

after accounting for the variance explained by `educ`

and `perf`

. If we find significant incremental variance explained in `raise`

by `sex`

and `race`

, then we potentially have evidence of managers exhibiting implicit or explicit bias when determining pay raise percentages for employees.

*Note: To save space, I don’t review the statistical assumption testing that should be conducted when estimating a multiple linear regression model. When analyzing real data, be sure to carry out those diagnostic tests.*

#### 59.2.4.1 Step One (Block One)

In the first step (first block), we will specify a multiple linear regression model with `raise`

as the outcome variable and `educ`

and `perf`

as the predictor variables. We are going to use the `lm`

(linear model) function from base R instead of the `Regression`

or `reg.brief`

functions from `lessR`

because the models we create using the `lm`

function are a bit easier to work with (in my opinion) when it comes to hierarchical linear regression. Using the `<-`

assignment operator, let’s name the model object `step1`

to reference the fact that it is our first step in the hierarchical linear regression process. Next, type the name of the `lm`

function. As the first parenthetical argument, specify the following regression model: `raise ~ educ + perf`

. Note that the tilde `~`

symbol separates the outcome variable (left of the tilde) from the predictor variables (right of the tilde). As the second argument, type `data=`

followed by the name of the data frame object (`pd`

) in which our model variables can be found. On a subsequent line, type the name of the `summary`

function from base R, with the name of our regression model object (`step1`

) as the sole parenthetical argument. Finally, to determine the number of cases retained for the analysis (which becomes relevant if any data were missing on the focal variables), use the `nrow`

function from base R with the `model.frame`

function from base R as an argument. As the sole argument for the `model.frame`

function, type the name of the regression model object (`step1`

).

```
# Step 1 Model: Regress raise on educ and perf
<- lm(raise ~ educ + perf, data=pd)
step1 summary(step1)
```

```
##
## Call:
## lm(formula = raise ~ educ + perf, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02060 -0.51293 -0.06175 0.51531 2.32785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.59071 0.33450 -1.766 0.079767 .
## educ 0.21700 0.06016 3.607 0.000441 ***
## perf 0.29603 0.03607 8.207 0.000000000000201 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7999 on 129 degrees of freedom
## Multiple R-squared: 0.3837, Adjusted R-squared: 0.3742
## F-statistic: 40.16 on 2 and 129 DF, p-value: 0.00000000000002752
```

```
# Number of cases retained for the analysis
nrow(model.frame(step1))
```

`## [1] 132`

Let’s review the findings from the first step of our hierarchical linear regression. Using a sample of 132 employees, we estimated the first step in a two-step hierarchical linear regression. Specifically, we regressed the percentage pay raise received by employees (`raise`

) on their education level (`educ`

) and supervisor-rated performance (`perf`

). We found that both education level (*b* = .217, *p* < .001) and performance (*b* = .296, *p* < .001) were significantly and positively associated with the percentage raise received when controlling for one another. The (unadjusted multiple) *R*^{2} and adjusted *R*^{2} values are .384 and .374 (*F* = 40.16, *p* < .001); focusing on the unadjusted *R*^{2} value, approximately 38% of the variance in percentage pay raise received by employees is explained by their education level and supervisor-rated performance scores. Because the *F*-value associated with the model is statistically significant (*p* < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

#### 59.2.4.2 Step Two (Block Two)

In the second step (second block), we will specify a multiple linear regression model with `raise`

as the outcome variable and `educ`

and `perf`

as the predictor variables, but we will also add the predictor variables of `sex`

and `race`

. For this model, let’s name the model object `step2`

.

```
# Step 2 Model: Regress raise on educ, perf, sex, and race
<- lm(raise ~ educ + perf + sex + race, data=pd)
step2 summary(step2)
```

```
##
## Call:
## lm(formula = raise ~ educ + perf + sex + race, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65048 -0.42558 -0.00791 0.45156 2.32498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.22478 0.29597 -4.138 0.00006355901993 ***
## educ 0.21890 0.04959 4.414 0.00002159553969 ***
## perf 0.28282 0.02988 9.464 < 0.0000000000000002 ***
## sexmale 0.16160 0.11633 1.389 0.167236
## raceblack 0.58704 0.14852 3.953 0.000128 ***
## racewhite 1.18247 0.15405 7.676 0.00000000000394 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6583 on 126 degrees of freedom
## Multiple R-squared: 0.5923, Adjusted R-squared: 0.5761
## F-statistic: 36.61 on 5 and 126 DF, p-value: < 0.00000000000000022
```

```
# Number of cases retained for the analysis
nrow(model.frame(step2))
```

`## [1] 132`

Let’s review the findings from the second step of our hierarchical linear regression. Using the same sample of 132 employees as the first step, we regressed the percentage pay raise received by employees (`raise`

) on their education level (`educ`

), supervisor-rated performance (`perf`

), sex (`sex`

), and race/ethnicity (`race`

). Both education level (*b* = .219, *p* < .001) and performance (*b* = .283, *p* < .001) remained significantly and positively associated with the percentage raise received when controlling for one another and the sex and race/ethnicity variables. On the one hand, employee sex was not significantly associated with the percentage pay raise received when controlling for the other variables in the model (*b* = .162, *p* = .167). On the other hand, employee race/ethnicity was significantly associated with percentage pay raise received. Because the employee race/ethnicity variable (`race`

) is categorical and has three or more levels (i.e., asian, black, white), the `lm`

function uses a process called **dummy coding** (**contrast coding**) to make comparisons between the different races/ethnicities with respect to the `raise`

outcome variable. For more information on dummy coding, see the section called Optional: More Information on Dummy Coding below. The results indicate that *black* employees received significantly higher percentage pay raises than *asian* employees (*b* = .587, *p* < .001) when controlling for other predictor variables in the model, and likewise *white* employees received significantly higher percentage pay raises than *asian* employees (*b* = 1.182, *p* < .001). Thus, there is evidence that employee race/ethnicity influenced manager decisions when it came to determining the pay raise percentage for employees, which could be evidence of implicit or explicit bias. The (unadjusted multiple) *R*^{2} and adjusted *R*^{2} values are .592 and .576 (*F* = 36.61, *p* < .001); focusing on the unadjusted *R*^{2} value, approximately 59% of the variance in percentage pay raise received by employees is explained by their education level, supervisor-rated performance scores, employee sex, and employee race/ethnicity. Because the *F*-value associated with the model is statistically significant (*p* < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

#### 59.2.4.3 Compare Nested Models

As a final step in our hierarchical linear regression, let’s use the `anova`

function from base R to do a nested model comparison between the `step1`

and `step2`

models. This will tell us whether adding `sex`

and `race`

to the model significantly improved model fit (i.e., explained incrementally more variance in the outcome) over and above `educ`

and `perf`

. Type the name of the `anova`

function. As the first argument, type the name of our first step model object (`step1`

), and as the second argument, type the name of our second step model object (`step2`

).

```
# Nested Model Comparison: Step 1 vs. Step 2
anova(step1, step2)
```

```
## Analysis of Variance Table
##
## Model 1: raise ~ educ + perf
## Model 2: raise ~ educ + perf + sex + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 129 82.541
## 2 126 54.604 3 27.936 21.488 0.00000000002637 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The *F*-value and its associated *p*-value (the latter of which is less than the conventional cutoff of .05) indicate that the second step model fits the data significantly better than the first step model. In other words, `sex`

and `race`

explain incremental variance in `raise`

beyond the variance already explained by `educ`

and `perf`

. In the context of a pay raise, this is of course concerning because an employee’s sex or race/ethnicity should not factor into the percentage raise they receive.

Finally, let’s reference the summary/output for each model and specifically the `r.squared`

values to determine what the change in *R*^{2} was from the first step to the second step. The change in *R*^{2} gives as an idea of how much (in terms of practical significance) the model fit improved when adding the additional variables (i.e., `sex`

, `race`

). Note that we use the `$`

symbol to indicate that we want the `r.squared`

value from the `summary(step2)`

and `summary(step1)`

output, and then we do a simple subtraction to get the change in *R*^{2}. Note that we typically only interpret the change in *R*^{2} if we find that the models fit the data significantly differently from one another based on the *F*-value and its associated *p*-value in the model comparison step.

```
# Change in R-squared
summary(step2)$r.squared - summary(step1)$r.squared
```

`## [1] 0.2085752`

As you can see, the change in *R*^{2} is .209, which indicates that `sex`

and `race`

explain approximately 21% additional variance in `raise`

beyond the variance already explained by `educ`

and `perf`

. If we consider our conventional rules of thumb for interpreting *R*^{2} as an effect size (i.e., indicator of practical significance), this is medium-to-large change, which is impressive in terms of magnitude (and deeply concerning in this specific context).

R^{2} |
Description |
---|---|

.01 | Small |

.09 | Medium |

.25 | Large |

#### 59.2.4.4 Optional: More Information on Dummy Coding

**Dummy coding** (**contrast coding**) is used in regression when we have a categorical predictor variable. To illustrate dummy coding, I reference the results from Step 2 above. In a simple two-level/category categorical variable such as `sex`

(*female*, *male*) just one dummy variable is needed to represent `sex`

, and by default, the `lm`

function uses the name of the level/category that comes first alphabetically as the reference group, setting it equal to 0 (e.g., *female* = 0); in turn, the name of the level/category that comes next alphabetically is used as the focal group and is set equal to 1 (e.g., *male* = 1). Using this coding, a positive and significant regression coefficient for *male* (i.e., *sexmale*) would indicate that male employees have higher scores/values on the outcome variable. Things get a little bit trickier when we have a categorical variable with three or more categories/levels. Using the present findings as an example, behind the scenes, the `lm`

function has created two dichotomous “dummy” variables with respect to the `race`

variable, both of which reference the focal level/category (e.g., *raceblack*) as 1 and all other levels/categories (e.g., *asian*, *white*) as 0. The number of dummy variables created to analyze a categorical variable is equal to the number of categories/levels (*k*) minus 1 (*k* - 1); in this example, we have three categories/levels (*asian*, *black*, *white*), and thus two dummy variables were created behind the scenes. Only two dummy variables are needed for a categorical variable with three levels/categories because it can be inferred what a third dummy variable’s values would be if we have two dummy variables’ values. In this example, because the *asian* level/category comes first alphabetically, it is treated as the reference group, which means its mean value is reflected in the intercept/constant value of -1.22478 when controlling for other variables in the model. We can interpret each dummy variable regression coefficient as the difference between the mean of the focal level/category (e.g., *raceblack*) and the intercept (e.g., *asian*) when controlling for other variables in the model. Thus, in the second step model above, *black* employees received significantly higher percentage pay raises than *asian* employees (*b* = .587, *p* < .001) when controlling for other predictor variables in the model, and likewise *white* employees received significantly higher percentage pay raises than *asian* employees (*b* = 1.182, *p* < .001).

Let’s dig deeper into dummy coding and illustrate its connection to analysis of variance (ANOVA). For solely explanatory purposes, let’s take a look at a model that only contains the dummy variables for the `race`

variable in relation to the `raise`

variable. Let’s name this model object `mod`

.

```
# Example model with dummy coding: Regress raise on race
<- lm(raise ~ race, data=pd)
mod summary(mod)
```

```
##
## Call:
## lm(formula = raise ~ race, data = pd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78478 -0.59925 0.00645 0.61522 2.31522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5935 0.1600 9.961 < 0.0000000000000002 ***
## raceblack 0.6228 0.2001 3.113 0.00228 **
## racewhite 1.2912 0.2070 6.238 0.00000000586 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8908 on 129 degrees of freedom
## Multiple R-squared: 0.2358, Adjusted R-squared: 0.224
## F-statistic: 19.9 on 2 and 129 DF, p-value: 0.00000002925
```

Note that in this model, the intercept/constant value of 1.594 represents the mean percentage raise for *asian* employees. The regression coefficient associated with *raceblack* (*b* = .623) represents the mean percentage raise for *black* employees minus the mean percentage raise for *asian* employees; because the values is positive, it means that *black* employees have a descriptively higher mean than *asian* employees. Next, the regression coefficient associated with *racewhite* (*b* = 1.291) represents the mean percentage raise for *black* employees minus the mean percentage raise for *asian* employees; because the values is positive, it means that *white* employees also have a descriptively higher mean than *asian* employees.

We can confirm that the regression coefficient values mean outcome for each level of the categorical variable by using the `tapply`

and `mean`

functions to calculate the mean for the `raise`

variable for each level of the `race`

variable.

```
# Mean values on outcome variable for each level of categorical variable
tapply(pd$raise, pd$race, mean)
```

```
## asian black white
## 1.593548 2.216364 2.884783
```

Note that the intercept/constant value for above, which represents the mean percentage raise for the reference group (*asian* employees), is 1.594; this is the same as the mean we just calculated. In addition, if we add the regression coefficient for the dummy variable associated with *black* employees to the mean of the reference group (*asian* employees), we get 2.216 (1.5935 + .6228 = 2.2164), which is equal to the mean for *black* employees we calculated. Finally, if we add the regression coefficient for the dummy variable associated with *white* employees to the mean of the reference group (*asian* employees), we get 2.885 (1.5935 + 1.2912 = 2.8848), which is equal to the mean for *white* employees we calculated.

Next, using the `anova`

function from base R, we can interpret the regression model (`mod`

) as a one-way ANOVA.

```
# Display ANOVA results
anova(mod)
```

```
## Analysis of Variance Table
##
## Response: raise
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 31.585 15.7927 19.904 0.00000002925 ***
## Residuals 129 102.353 0.7934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The row in the table associated with *race* can be interpreted as an omnibus *F*-test for a one-way ANOVA. As you can see, the `race`

variable explained a significant amount of the variability in `raise`

(*F* = 19.904, *p* < .001). However, like any omnibus test with three or more levels for the categorical predictor variable, we don’t know yet where the significant mean differences exist. The summary of the regression model above, provides us with information about those mean comparisons in the form of regression coefficients. In addition to illustrating how dummy coding works, this also shows that a one-way ANOVA (and other types of ANOVA) can be implemented and interpreted in a regression framework. This just goes to show that ANOVA is a specific type of regression.

### 59.2.5 Summary

In this chapter, we learned how to perform hierarchical linear regression in order to identify pay determinants. Hierarchical linear regression is a specific application of multiple linear regression in which we build up to a final/full model in what are often called steps or blocks. Specifically, we are comparing nested models. In this tutorial, we learned how to use the `lm`

and `anova`

functions from base R to conduct hierarchical linear regression.

### References

*Readr: Read Rectangular Text Data*. https://CRAN.R-project.org/package=readr.