Chapter 59 Identifying Pay Determinants Using Hierarchical Linear Regression

In this chapter, we will learn how to identify pay determinants using hierarchical linear regression.

59.1 Conceptual Overview

Pay determinants refer to any characteristics, behaviors, or conditions that are associated with the amount of pay employees receive. For example, in an organization that monetarily rewards those with longer lengths of service, length of service will likely be a pay determinant. Ideally, pay determinants should be legally defensible and should be related to job performance and other characteristics or behaviors that are valued by the organization. We can use hierarchical linear regression to identify pay determinants and evaluate their incremental validity.

59.1.1 Review of Hiearchical Linear Regression

Hierarchical linear regression is an extension of multiple linear regression, where the latter was introduced in the chapter on incremental validity. The approach simply means that we are building up to a full model in what are called steps or blocks, wherein each step or block we add more predictor variables to our model until we eventually reach whatever we have specified as our full model. There are different reasons why we might do this. For instance, we might want to know if, collectively, two or more additional predictor variables explain incremental variance in the outcome variable after accounting for the other predictor variables already in the model. In doing so, we can compare what are referred to as nested models. A nested model model is a model that is missing one or more of the predictor variables of a full model (i.e., larger model) that contains the same predictor variables as the nested model as well as additional predictor variables. In practice, it is fairly simple to implement hierarchical linear regression as long as you are already comfortable with multiple linear regression.

59.1.1.1 Statistical Assumptions

Because we will be performing hierarchical linear regression using multiple linear regression, to estimate and interpret the associated models, we need to make sure we have satisfied the statistical assumptions of multiple linear regression, which are described here.

59.1.1.2 Statistical Significance

For information regarding statistical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

59.1.1.3 Practical Significance

For information regarding practical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

59.1.2 Conceptual Videos

For a more in-depth review of pay determinants, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/EMC2aBvjB2s

For a more in-depth review of hierarchical linear regression, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/Ney5uM6sWgY

59.2 Tutorial

This chapter’s tutorial demonstrates how to perform hierarchical linear regression in R.

59.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/OGXX3ldUMoI

59.2.2 Functions & Packages Introduced

Function Package
lm base R
summary base R
nrow base R
model.frame base R
anova base R
tapply base R
mean base R

59.2.3 Initial Steps

If you haven’t already, save the file called “PayDeterminants.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data file called “PayDeterminants.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2023). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)

# Read data and name data frame (tibble) object
pd <- read_csv("PayDeterminants.csv")
## Rows: 132 Columns: 5
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (3): raise, educ, perf
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) object
names(pd)
## [1] "raise" "educ"  "perf"  "sex"   "race"
# Print variable type for each variable in data frame (tibble) object
str(pd)
## spc_tbl_ [132 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ raise: num [1:132] 2.4 1.7 2.6 1.1 3.6 2.4 1 0.6 2.2 2.5 ...
##  $ educ : num [1:132] 3 3 5 5 4 2 2 2 4 4 ...
##  $ perf : num [1:132] 7.1 3.8 10 4 10 8.9 9.9 5.4 7.4 7.5 ...
##  $ sex  : chr [1:132] "female" "female" "male" "female" ...
##  $ race : chr [1:132] "asian" "black" "black" "black" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   raise = col_double(),
##   ..   educ = col_double(),
##   ..   perf = col_double(),
##   ..   sex = col_character(),
##   ..   race = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Print first 6 rows of data frame (tibble) object
head(pd)
## # A tibble: 6 × 5
##   raise  educ  perf sex    race 
##   <dbl> <dbl> <dbl> <chr>  <chr>
## 1   2.4     3   7.1 female asian
## 2   1.7     3   3.8 female black
## 3   2.6     5  10   male   black
## 4   1.1     5   4   female black
## 5   3.6     4  10   female black
## 6   2.4     2   8.9 male   black
# Print number of rows in data frame (tibble) object
nrow(pd)
## [1] 132

There are 5 variables and 132 cases (i.e., employees) in the pd data frame: raise, educ, perf, sex, and race. Per the output of the str (structure) function above, the raise, educ, and perf variables are of type numeric, and the sex and race variables are of type character. The raise variable represents the percentage annual raise each employee received. The educ variable is ordinal but numerically coded and can be treated as a continuous variable in this context, where 1 = high school diploma or GED, 2 = some college, 3 = associate’s degree, 4 = bachelor’s degree, and 5 = graduate certificate or degree. The perf variable contains the supervisor-rated annual performance rating scores, which can range from 1-10, with 10 indicating very high performance. The sexvariable refers to employee sex (female, male). The race variable contains three level (asian, black, white).

59.2.4 Perform Hierarchical Linear Regression

I will demonstrate one approach to running hierarchical linear regression models, which is fairly straightforward to implement. The approach requires the use of the lm and anova functions from base R. Specifically, I will demonstrate hierarchical linear regression using a two-step (i.e., two-block) example. For both steps/blocks, the percentage pay raise (raise) will be the outcome variable. In the first step/block, we will specify a model in which education level (educ) and performance level (perf) are the predictor variables, as these are expected to be acceptable determinants of pay raise percentage. In the second step/block, we will specify the full model in which sex (sex) and race/ethnicity (race) are added to the model as additional predictor variables, where sex and race are considered unacceptable determinants of pay raise percentage. Our goal will be two determine if the sex and race variables collectively explain a statistically significant amount of incremental variance in the outcome variable raise after accounting for the variance explained by educ and perf. If we find significant incremental variance explained in raise by sex and race, then we potentially have evidence of managers exhibiting implicit or explicit bias when determining pay raise percentages for employees.

Note: To save space, I don’t review the statistical assumption testing that should be conducted when estimating a multiple linear regression model. When analyzing real data, be sure to carry out those diagnostic tests.

59.2.4.1 Step One (Block One)

In the first step (first block), we will specify a multiple linear regression model with raise as the outcome variable and educ and perf as the predictor variables. We are going to use the lm (linear model) function from base R instead of the Regression or reg.brief functions from lessR because the models we create using the lm function are a bit easier to work with (in my opinion) when it comes to hierarchical linear regression. Using the <- assignment operator, let’s name the model object step1 to reference the fact that it is our first step in the hierarchical linear regression process. Next, type the name of the lm function. As the first parenthetical argument, specify the following regression model: raise ~ educ + perf. Note that the tilde ~ symbol separates the outcome variable (left of the tilde) from the predictor variables (right of the tilde). As the second argument, type data= followed by the name of the data frame object (pd) in which our model variables can be found. On a subsequent line, type the name of the summary function from base R, with the name of our regression model object (step1) as the sole parenthetical argument. Finally, to determine the number of cases retained for the analysis (which becomes relevant if any data were missing on the focal variables), use the nrow function from base R with the model.frame function from base R as an argument. As the sole argument for the model.frame function, type the name of the regression model object (step1).

# Step 1 Model: Regress raise on educ and perf
step1 <- lm(raise ~ educ + perf, data=pd)
summary(step1)
## 
## Call:
## lm(formula = raise ~ educ + perf, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02060 -0.51293 -0.06175  0.51531  2.32785 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept) -0.59071    0.33450  -1.766          0.079767 .  
## educ         0.21700    0.06016   3.607          0.000441 ***
## perf         0.29603    0.03607   8.207 0.000000000000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7999 on 129 degrees of freedom
## Multiple R-squared:  0.3837, Adjusted R-squared:  0.3742 
## F-statistic: 40.16 on 2 and 129 DF,  p-value: 0.00000000000002752
# Number of cases retained for the analysis
nrow(model.frame(step1))
## [1] 132

Let’s review the findings from the first step of our hierarchical linear regression. Using a sample of 132 employees, we estimated the first step in a two-step hierarchical linear regression. Specifically, we regressed the percentage pay raise received by employees (raise) on their education level (educ) and supervisor-rated performance (perf). We found that both education level (b = .217, p < .001) and performance (b = .296, p < .001) were significantly and positively associated with the percentage raise received when controlling for one another. The (unadjusted multiple) R2 and adjusted R2 values are .384 and .374 (F = 40.16, p < .001); focusing on the unadjusted R2 value, approximately 38% of the variance in percentage pay raise received by employees is explained by their education level and supervisor-rated performance scores. Because the F-value associated with the model is statistically significant (p < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

59.2.4.2 Step Two (Block Two)

In the second step (second block), we will specify a multiple linear regression model with raise as the outcome variable and educ and perf as the predictor variables, but we will also add the predictor variables of sex and race. For this model, let’s name the model object step2.

# Step 2 Model: Regress raise on educ, perf, sex, and race
step2 <- lm(raise ~ educ + perf + sex + race, data=pd)
summary(step2)
## 
## Call:
## lm(formula = raise ~ educ + perf + sex + race, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65048 -0.42558 -0.00791  0.45156  2.32498 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.22478    0.29597  -4.138     0.00006355901993 ***
## educ         0.21890    0.04959   4.414     0.00002159553969 ***
## perf         0.28282    0.02988   9.464 < 0.0000000000000002 ***
## sexmale      0.16160    0.11633   1.389             0.167236    
## raceblack    0.58704    0.14852   3.953             0.000128 ***
## racewhite    1.18247    0.15405   7.676     0.00000000000394 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6583 on 126 degrees of freedom
## Multiple R-squared:  0.5923, Adjusted R-squared:  0.5761 
## F-statistic: 36.61 on 5 and 126 DF,  p-value: < 0.00000000000000022
# Number of cases retained for the analysis
nrow(model.frame(step2))
## [1] 132

Let’s review the findings from the second step of our hierarchical linear regression. Using the same sample of 132 employees as the first step, we regressed the percentage pay raise received by employees (raise) on their education level (educ), supervisor-rated performance (perf), sex (sex), and race/ethnicity (race). Both education level (b = .219, p < .001) and performance (b = .283, p < .001) remained significantly and positively associated with the percentage raise received when controlling for one another and the sex and race/ethnicity variables. On the one hand, employee sex was not significantly associated with the percentage pay raise received when controlling for the other variables in the model (b = .162, p = .167). On the other hand, employee race/ethnicity was significantly associated with percentage pay raise received. Because the employee race/ethnicity variable (race) is categorical and has three or more levels (i.e., asian, black, white), the lm function uses a process called dummy coding (contrast coding) to make comparisons between the different races/ethnicities with respect to the raise outcome variable. For more information on dummy coding, see the section called Optional: More Information on Dummy Coding below. The results indicate that black employees received significantly higher percentage pay raises than asian employees (b = .587, p < .001) when controlling for other predictor variables in the model, and likewise white employees received significantly higher percentage pay raises than asian employees (b = 1.182, p < .001). Thus, there is evidence that employee race/ethnicity influenced manager decisions when it came to determining the pay raise percentage for employees, which could be evidence of implicit or explicit bias. The (unadjusted multiple) R2 and adjusted R2 values are .592 and .576 (F = 36.61, p < .001); focusing on the unadjusted R2 value, approximately 59% of the variance in percentage pay raise received by employees is explained by their education level, supervisor-rated performance scores, employee sex, and employee race/ethnicity. Because the F-value associated with the model is statistically significant (p < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

59.2.4.3 Compare Nested Models

As a final step in our hierarchical linear regression, let’s use the anova function from base R to do a nested model comparison between the step1 and step2 models. This will tell us whether adding sex and race to the model significantly improved model fit (i.e., explained incrementally more variance in the outcome) over and above educ and perf. Type the name of the anova function. As the first argument, type the name of our first step model object (step1), and as the second argument, type the name of our second step model object (step2).

# Nested Model Comparison: Step 1 vs. Step 2
anova(step1, step2)
## Analysis of Variance Table
## 
## Model 1: raise ~ educ + perf
## Model 2: raise ~ educ + perf + sex + race
##   Res.Df    RSS Df Sum of Sq      F           Pr(>F)    
## 1    129 82.541                                         
## 2    126 54.604  3    27.936 21.488 0.00000000002637 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and its associated p-value (the latter of which is less than the conventional cutoff of .05) indicate that the second step model fits the data significantly better than the first step model. In other words, sex and race explain incremental variance in raise beyond the variance already explained by educ and perf. In the context of a pay raise, this is of course concerning because an employee’s sex or race/ethnicity should not factor into the percentage raise they receive.

Finally, let’s reference the summary/output for each model and specifically the r.squared values to determine what the change in R2 was from the first step to the second step. The change in R2 gives as an idea of how much (in terms of practical significance) the model fit improved when adding the additional variables (i.e., sex, race). Note that we use the $ symbol to indicate that we want the r.squared value from the summary(step2) and summary(step1) output, and then we do a simple subtraction to get the change in R2. Note that we typically only interpret the change in R2 if we find that the models fit the data significantly differently from one another based on the F-value and its associated p-value in the model comparison step.

# Change in R-squared
summary(step2)$r.squared - summary(step1)$r.squared
## [1] 0.2085752

As you can see, the change in R2 is .209, which indicates that sex and race explain approximately 21% additional variance in raise beyond the variance already explained by educ and perf. If we consider our conventional rules of thumb for interpreting R2 as an effect size (i.e., indicator of practical significance), this is medium-to-large change, which is impressive in terms of magnitude (and deeply concerning in this specific context).

R2 Description
.01 Small
.09 Medium
.25 Large

59.2.4.4 Optional: More Information on Dummy Coding

Dummy coding (contrast coding) is used in regression when we have a categorical predictor variable. To illustrate dummy coding, I reference the results from Step 2 above. In a simple two-level/category categorical variable such as sex (female, male) just one dummy variable is needed to represent sex, and by default, the lm function uses the name of the level/category that comes first alphabetically as the reference group, setting it equal to 0 (e.g., female = 0); in turn, the name of the level/category that comes next alphabetically is used as the focal group and is set equal to 1 (e.g., male = 1). Using this coding, a positive and significant regression coefficient for male (i.e., sexmale) would indicate that male employees have higher scores/values on the outcome variable. Things get a little bit trickier when we have a categorical variable with three or more categories/levels. Using the present findings as an example, behind the scenes, the lm function has created two dichotomous “dummy” variables with respect to the race variable, both of which reference the focal level/category (e.g., raceblack) as 1 and all other levels/categories (e.g., asian, white) as 0. The number of dummy variables created to analyze a categorical variable is equal to the number of categories/levels (k) minus 1 (k - 1); in this example, we have three categories/levels (asian, black, white), and thus two dummy variables were created behind the scenes. Only two dummy variables are needed for a categorical variable with three levels/categories because it can be inferred what a third dummy variable’s values would be if we have two dummy variables’ values. In this example, because the asian level/category comes first alphabetically, it is treated as the reference group, which means its mean value is reflected in the intercept/constant value of -1.22478 when controlling for other variables in the model. We can interpret each dummy variable regression coefficient as the difference between the mean of the focal level/category (e.g., raceblack) and the intercept (e.g., asian) when controlling for other variables in the model. Thus, in the second step model above, black employees received significantly higher percentage pay raises than asian employees (b = .587, p < .001) when controlling for other predictor variables in the model, and likewise white employees received significantly higher percentage pay raises than asian employees (b = 1.182, p < .001).

Let’s dig deeper into dummy coding and illustrate its connection to analysis of variance (ANOVA). For solely explanatory purposes, let’s take a look at a model that only contains the dummy variables for the race variable in relation to the raise variable. Let’s name this model object mod.

# Example model with dummy coding: Regress raise on race
mod <- lm(raise ~ race, data=pd)
summary(mod)
## 
## Call:
## lm(formula = raise ~ race, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78478 -0.59925  0.00645  0.61522  2.31522 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   1.5935     0.1600   9.961 < 0.0000000000000002 ***
## raceblack     0.6228     0.2001   3.113              0.00228 ** 
## racewhite     1.2912     0.2070   6.238        0.00000000586 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8908 on 129 degrees of freedom
## Multiple R-squared:  0.2358, Adjusted R-squared:  0.224 
## F-statistic:  19.9 on 2 and 129 DF,  p-value: 0.00000002925

Note that in this model, the intercept/constant value of 1.594 represents the mean percentage raise for asian employees. The regression coefficient associated with raceblack (b = .623) represents the mean percentage raise for black employees minus the mean percentage raise for asian employees; because the values is positive, it means that black employees have a descriptively higher mean than asian employees. Next, the regression coefficient associated with racewhite (b = 1.291) represents the mean percentage raise for black employees minus the mean percentage raise for asian employees; because the values is positive, it means that white employees also have a descriptively higher mean than asian employees.

We can confirm that the regression coefficient values mean outcome for each level of the categorical variable by using the tapply and mean functions to calculate the mean for the raise variable for each level of the race variable.

# Mean values on outcome variable for each level of categorical variable
tapply(pd$raise, pd$race, mean)
##    asian    black    white 
## 1.593548 2.216364 2.884783

Note that the intercept/constant value for above, which represents the mean percentage raise for the reference group (asian employees), is 1.594; this is the same as the mean we just calculated. In addition, if we add the regression coefficient for the dummy variable associated with black employees to the mean of the reference group (asian employees), we get 2.216 (1.5935 + .6228 = 2.2164), which is equal to the mean for black employees we calculated. Finally, if we add the regression coefficient for the dummy variable associated with white employees to the mean of the reference group (asian employees), we get 2.885 (1.5935 + 1.2912 = 2.8848), which is equal to the mean for white employees we calculated.

Next, using the anova function from base R, we can interpret the regression model (mod) as a one-way ANOVA.

# Display ANOVA results
anova(mod)
## Analysis of Variance Table
## 
## Response: raise
##            Df  Sum Sq Mean Sq F value        Pr(>F)    
## race        2  31.585 15.7927  19.904 0.00000002925 ***
## Residuals 129 102.353  0.7934                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The row in the table associated with race can be interpreted as an omnibus F-test for a one-way ANOVA. As you can see, the race variable explained a significant amount of the variability in raise (F = 19.904, p < .001). However, like any omnibus test with three or more levels for the categorical predictor variable, we don’t know yet where the significant mean differences exist. The summary of the regression model above, provides us with information about those mean comparisons in the form of regression coefficients. In addition to illustrating how dummy coding works, this also shows that a one-way ANOVA (and other types of ANOVA) can be implemented and interpreted in a regression framework. This just goes to show that ANOVA is a specific type of regression.

59.2.5 Summary

In this chapter, we learned how to perform hierarchical linear regression in order to identify pay determinants. Hierarchical linear regression is a specific application of multiple linear regression in which we build up to a final/full model in what are often called steps or blocks. Specifically, we are comparing nested models. In this tutorial, we learned how to use the lm and anova functions from base R to conduct hierarchical linear regression.

References

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2023. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.