Chapter 63 Identifying Pay Determinants & Evaluating Pay Equity Using Hierarchical Linear Regression

In this chapter, we will learn how to identify pay determinants using hierarchical linear regression. Identifying pay determinants is an important part of pay equity analyses.

63.1 Conceptual Overview

Pay determinants refer to any characteristics, behaviors, or conditions that are associated with the amount of pay employees receive. For example, in an organization that monetarily rewards those with longer lengths of service, length of service will likely be a pay determinant. Ideally, pay determinants should be legally defensible and should be related to job performance and other characteristics or behaviors that are valued by the organization. We can use hierarchical linear regression to identify pay determinants, evaluate their incremental validity, and determine whether there is evidence of pay inequity based on protected class variables.

63.1.1 Review of Hiearchical Linear Regression

Hierarchical linear regression is an extension of multiple linear regression, where the latter was introduced in the chapter on incremental validity. The approach simply means that we are building up to a full model in what are called steps or blocks, wherein each step or block we add more predictor variables to our model until we eventually reach whatever we have specified as our full model. There are different reasons why we might do this. For instance, we might want to know if, collectively, two or more additional predictor variables explain incremental variance in the outcome variable after accounting for the other predictor variables already in the model. In doing so, we can compare what are referred to as nested models. A nested model model is a model that is missing one or more of the predictor variables of a full model (i.e., larger model) that contains the same predictor variables as the nested model as well as additional predictor variables. In practice, it is fairly simple to implement hierarchical linear regression as long as you are already comfortable with multiple linear regression.

63.1.1.1 Statistical Assumptions

Because we will be performing hierarchical linear regression using multiple linear regression, to estimate and interpret the associated models, we need to make sure we have satisfied the statistical assumptions of multiple linear regression, which are described here.

63.1.1.2 Statistical Significance

For information regarding statistical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

63.1.1.3 Practical Significance

For information regarding practical significance in the context of multiple linear regression, please refer to the corresponding section from this chapter on incremental validity.

63.1.2 Conceptual Videos

For a more in-depth review of pay determinants, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/EMC2aBvjB2s

For an introduction to pay equity and how hierarchical linear regression can be applied to identify potential pay inequity, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/L4HysNZn97U

For a more in-depth review of hierarchical linear regression, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/Ney5uM6sWgY

63.1.2.1 Sample Write-Up

Organizational leaders were interested in determining whether the pay raise percentages that Customer Service Representatives (CSRs) received showed evidence of pay inequity based on the protect characteristics of sex and race. To identify the pay determinants for the 132 CSRs, we performed followed a two-step hierarchical linear regression process suggested by Sady and colleagues (2017), and as advised, throughout the process we sought guidance from the organization’s legal counsel, an Equal Employment Opportunity (EEO) compliance employee, a senior compensation employee, and a compensation analyst.

In the first step, we estimated a multiple linear regression model in which the percentage pay raise variable served as the outcome variable and education level, job performance ratings, and seniority served as the merit-based predictor variables. We included education level, job performance ratings, and seniority as merit-based predictor variables because the senior compensation employee indicated that these factors are used by the organization to determine the size of CSRs’ pay raises. We found that both education level and job performance ratings were statistically significantly and positively associated with percentage pay raise received when statistically controlling other predictor variables in the model (b = .217, p < .001 and b = .296, p < .001, respectively), such that higher education levels and higher job performance ratings were associated with larger percentage pay raises; however, seniority was not not associated with percentage pay raise to a statistically significant extent (b = .217, p = .101). In terms of model fit, the unadjusted R² and adjusted R² values were .384 and .374, respectively, where the unadjusted R² value indicates that approximately 38% of the variance in percentage pay raise received by CSRs was explained by their education level, job performance ratings, and seniority. As evidenced by the F-test, the model fit the data better than a model containing no predictor variables (F = 40.162, p < .001). In terms of practical significance, we can conclude that the first-step model’s fit to the data is large, as evidence by the unadjusted R².

In the second step, we added sex and race as protected-characteristics predictor variables to a multiple linear regression model that also included the same merit-based predictor variables as our first model (i.e., education level, job performance ratings, seniority); that is, our first-step model is nested within our second-step model. Both education level and job performance ratings remained statistically significantly and positively associated with the percentage pay raise received when controlling for other predictor variables in the model (b = .219, p < .001 and b = .283, p = .009, respectively), and seniority remained nonsignificant when controlling for the other predictor variables in the model (b = .205, p = .119). Regarding the protected class variables, we re-leveled both the sex and race variables to designate our chosen reference groups. For sex, we designated male employees as the reference group because the company employed more male employees than female employees and because, historically within the United States, males have tended to be the advantaged group in terms of pay. Similarly, for race, we designated white employees as the reference group because the company employed more white employees than other race groups and because, historically within the United States, white individuals tended to be the advantaged group in terms of pay. On the one hand, sex was not statistically significantly associated with the percentage pay raise received when controlling for the other variables in the model (b = .162, p = .167), which indicates that female CSRs were not, on average, paid differently than male CSRs when holding constant other predictor variables in the model. On the other hand, employee race was significantly associated with the percentage pay raise received, when statistically controlling for other predictor variables; specifically, white employees received significantly higher percentage pay raises, on average, than both Asian employees (b = -1.182, p < .001) and Black CSRs (b = -.595, p < .001). In terms of model fit, the unadjusted R² and adjusted R² values were .592 and .576, respectively, where the unadjusted R² value indicates that approximately 59% of the variance in percentage pay raise received by CSRs was explained by their education level, job performance ratings, seniority, sex, and race. As evidenced by the F-test, the model fit the data better than a model containing no predictor variables (F = 36.613, p < .001). In terms of practical significance, we can conclude that the second-step model’s fit to the data is large, as evidence by the unadjusted R².

Because the first-step model is nested within the second-step model, we performed a nested model comparison and found that the second-step model fit the data statistically significantly better than the first-step model (F = 21.492, p < .001). In terms of practical significance, the difference in unadjusted R² was .209, which indicates that the second-step model explained an additional 21% of the variance in percentage pay raise when compared to the first-step model, which can be considered a medium-to-large difference.

In sum, when statistically controlling for the merit-based predictor variables of education level, job performance ratings, and seniority, the protected characteristic variable of sex did not show evidence of incremental validity, whereas the protected variable of race did; these findings provide evidence of pay inequity based on race, but not based on sex, in terms of CSRs’ percentage pay raises. Moreover, collectively, sex and race explained an additional 21% of the variability in percentage pay raises.

63.2 Tutorial

This chapter’s tutorial demonstrates how to perform hierarchical linear regression in R.

63.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/OGXX3ldUMoI

63.2.2 Functions & Packages Introduced

Function	Package
`lm`	base R
`summary`	base R
`nrow`	base R
`model.frame`	base R
`anova`	base R
`tapply`	base R
`mean`	base R

63.2.3 Initial Steps

If you haven’t already, save the file called “PayDeterminants.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data file called “PayDeterminants.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2024). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")

# Access readr package
library(readr)

# Read data and name data frame (tibble) object
pd <- read_csv("PayDeterminants.csv")

## Rows: 132 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (3): raise, educ, perf
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Print the names of the variables in the data frame (tibble) object
names(pd)

## [1] "raise" "educ"  "perf"  "sex"   "race"

# Print variable type for each variable in data frame (tibble) object
str(pd)

## spc_tbl_ [132 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ raise: num [1:132] 2.4 1.7 2.6 1.1 3.6 2.4 1 0.6 2.2 2.5 ...
##  $ educ : num [1:132] 3 3 5 5 4 2 2 2 4 4 ...
##  $ perf : num [1:132] 7.1 3.8 10 4 10 8.9 9.9 5.4 7.4 7.5 ...
##  $ sex  : chr [1:132] "female" "female" "male" "female" ...
##  $ race : chr [1:132] "asian" "black" "black" "black" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   raise = col_double(),
##   ..   educ = col_double(),
##   ..   perf = col_double(),
##   ..   sex = col_character(),
##   ..   race = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Print first 6 rows of data frame (tibble) object
head(pd)

## # A tibble: 6 × 5
##   raise  educ  perf sex    race 
##   <dbl> <dbl> <dbl> <chr>  <chr>
## 1   2.4     3   7.1 female asian
## 2   1.7     3   3.8 female black
## 3   2.6     5  10   male   black
## 4   1.1     5   4   female black
## 5   3.6     4  10   female black
## 6   2.4     2   8.9 male   black

# Print number of rows in data frame (tibble) object
nrow(pd)

## [1] 132

There are 5 variables and 132 cases (i.e., employees) in the pd data frame: raise, educ, perf, sex, and race. Per the output of the str (structure) function above, the raise, educ, and perf variables are of type numeric, and the sex and race variables are of type character. The raise variable represents the percentage annual raise each employee received. The educ variable is ordinal but numerically coded and can be treated as a continuous variable in this context, where 1 = high school diploma or GED, 2 = some college, 3 = associate’s degree, 4 = bachelor’s degree, and 5 = graduate certificate or degree. The perf variable contains the supervisor-rated annual performance rating scores, which can range from 1-10, with 10 indicating very high performance. The sexvariable refers to employee sex (female, male). The race variable contains three level (asian, black, white).

63.2.4 Perform Hierarchical Linear Regression

I will demonstrate one approach to running hierarchical linear regression models, which is fairly straightforward to implement. The approach requires the use of the lm and anova functions from base R. Specifically, I will demonstrate hierarchical linear regression using a two-step (i.e., two-block) example. For both steps/blocks, the percentage pay raise (raise) will be the outcome variable. In the first step/block, we will specify a model in which education level (educ) and performance level (perf) are the predictor variables, as these are expected to be acceptable determinants of pay raise percentage. In the second step/block, we will specify the full model in which sex (sex) and race/ethnicity (race) are added to the model as additional predictor variables, where sex and race are considered unacceptable determinants of pay raise percentage. Our goal will be two determine if the sex and race variables collectively explain a statistically significant amount of incremental variance in the outcome variable raise after accounting for the variance explained by educ and perf. If we find significant incremental variance explained in raise by sex and race, then we potentially have evidence of managers exhibiting implicit or explicit bias when determining pay raise percentages for employees.

Note: To save space, I don’t review the statistical assumption testing that should be conducted when estimating a multiple linear regression model. When analyzing real data, be sure to carry out those diagnostic tests, and pay close attention to collinearity. To learn how to test the statistical assumptions for a multiple linear regression model, please refer to the chapter on estimating the incremental validity of a selection tool using multiple linear regression.

63.2.4.1 Step One (Block One)

In the first step (first block), we will specify a multiple linear regression model with raise as the outcome variable and educ and perf as the predictor variables. We are going to use the lm (linear model) function from base R instead of the Regression or reg.brief functions from lessR because the models we create using the lm function are a bit easier to work with (in my opinion) when it comes to hierarchical linear regression. Using the <- assignment operator, let’s name the model object step1 to reference the fact that it is our first step in the hierarchical linear regression process. Next, type the name of the lm function. As the first parenthetical argument, specify the following regression model: raise ~ educ + perf. Note that the tilde ~ symbol separates the outcome variable (left of the tilde) from the predictor variables (right of the tilde). As the second argument, type data= followed by the name of the data frame object (pd) in which our model variables can be found. On a subsequent line, type the name of the summary function from base R, with the name of our regression model object (step1) as the sole parenthetical argument. Finally, to determine the number of cases retained for the analysis (which becomes relevant if any data were missing on the focal variables), use the nrow function from base R with the model.frame function from base R as an argument. As the sole argument for the model.frame function, type the name of the regression model object (step1).

# Step 1 Model: Regress raise on educ and perf
step1 <- lm(raise ~ educ + perf, data=pd)
summary(step1)

## 
## Call:
## lm(formula = raise ~ educ + perf, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02060 -0.51293 -0.06175  0.51531  2.32785 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept) -0.59071    0.33450  -1.766          0.079767 .  
## educ         0.21700    0.06016   3.607          0.000441 ***
## perf         0.29603    0.03607   8.207 0.000000000000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7999 on 129 degrees of freedom
## Multiple R-squared:  0.3837, Adjusted R-squared:  0.3742 
## F-statistic: 40.16 on 2 and 129 DF,  p-value: 0.00000000000002752

# Number of cases retained for the analysis
nrow(model.frame(step1))

## [1] 132

Let’s review the findings from the first step of our hierarchical linear regression. Using a sample of 132 employees, we estimated the first step in a two-step hierarchical linear regression. Specifically, we regressed the percentage pay raise received by employees (raise) on their education level (educ) and supervisor-rated performance (perf). We found that both education level (b = .217, p < .001) and performance (b = .296, p < .001) were significantly and positively associated with the percentage raise received when controlling for one another. The (unadjusted multiple) R² and adjusted R² values are .384 and .374 (F = 40.16, p < .001); focusing on the unadjusted R² value, approximately 38% of the variance in percentage pay raise received by employees is explained by their education level and supervisor-rated performance scores. Because the F-value associated with the model is statistically significant (p < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

63.2.4.2 Step Two (Block Two)

In the second step (second block), we will specify a multiple linear regression model with raise as the outcome variable and educ and perf as the predictor variables, but we will also add the predictor variables of sex and race. For this model, let’s name the model object step2.

# Step 2 Model: Regress raise on educ, perf, sex, and race
step2 <- lm(raise ~ educ + perf + sex + race, data=pd)
summary(step2)

## 
## Call:
## lm(formula = raise ~ educ + perf + sex + race, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65048 -0.42558 -0.00791  0.45156  2.32498 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.22478    0.29597  -4.138     0.00006355901993 ***
## educ         0.21890    0.04959   4.414     0.00002159553969 ***
## perf         0.28282    0.02988   9.464 < 0.0000000000000002 ***
## sexmale      0.16160    0.11633   1.389             0.167236    
## raceblack    0.58704    0.14852   3.953             0.000128 ***
## racewhite    1.18247    0.15405   7.676     0.00000000000394 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6583 on 126 degrees of freedom
## Multiple R-squared:  0.5923, Adjusted R-squared:  0.5761 
## F-statistic: 36.61 on 5 and 126 DF,  p-value: < 0.00000000000000022

# Number of cases retained for the analysis
nrow(model.frame(step2))

## [1] 132

Let’s review the findings from the second step of our hierarchical linear regression. Using the same sample of 132 employees as the first step, we regressed the percentage pay raise received by employees (raise) on their education level (educ), supervisor-rated performance (perf), sex (sex), and race/ethnicity (race). Both education level (b = .219, p < .001) and performance (b = .283, p < .001) remained significantly and positively associated with the percentage raise received when controlling for one another and the sex and race/ethnicity variables. On the one hand, employee sex was not significantly associated with the percentage pay raise received when controlling for the other variables in the model (b = .162, p = .167), indicating that females were not paid differently than males, on average. On the other hand, employee race/ethnicity was significantly associated with percentage pay raise received. Because the employee race/ethnicity variable (race) is categorical and has three or more levels (i.e., asian, black, white), the lm function uses a process called dummy coding (contrast coding) to make comparisons between the different races/ethnicities with respect to the raise outcome variable. For more information on dummy coding, see the section called Optional: More Information on Dummy Coding below. By default, the lm function orders the race categories alphabetical order, and the first category alphabetically automatically serves as the reference category, which in this case is the asian category. The results indicate that black employees received significantly higher percentage pay raises than asian employees (b = .587, p < .001) when controlling for other predictor variables in the model, and likewise white employees received significantly higher percentage pay raises than asian employees (b = 1.182, p < .001). As you might have noticed, the output does not give us the comparison between black and white employees.

To compare those two groups, we need to re-level the race variable; in fact, we often re-level categorical protected class variables before estimating the second-step mutliple linear regression model in order to designate specific reference groups that have been historically or systemically advantaged or that have the largest proportion of employees in the organization. To do so, we will overwrite the existing race variable in our pd data frame object using the <- operator. Using the relevel function from base R, we will specify the as.factor function from base R as the first argument; within the as.factor function parentheses, we will specify the name of the data frame object pd followed by the $ operator and the name of the race variable. As the second argument in the relevel function, we will specify the ref= argument followed by the name of the category that we wish to set as the new reference group, which in this case is white.

# Re-level the race variable to make white the reference group
pd$race <- relevel(as.factor(pd$race), ref="white")

With the race variable re-leveled, we are ready to re-estimate our multiple linear regression model.

# Step 2 Model: Regress raise on educ, perf, sex, and re-leveled race variable
step2 <- lm(raise ~ educ + perf + sex + race, data=pd)
summary(step2)

## 
## Call:
## lm(formula = raise ~ educ + perf + sex + race, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65048 -0.42558 -0.00791  0.45156  2.32498 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -0.04231    0.29843  -0.142                0.887    
## educ         0.21890    0.04959   4.414     0.00002159553969 ***
## perf         0.28282    0.02988   9.464 < 0.0000000000000002 ***
## sexmale      0.16160    0.11633   1.389                0.167    
## raceasian   -1.18247    0.15405  -7.676     0.00000000000394 ***
## raceblack   -0.59543    0.13203  -4.510     0.00001465492514 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6583 on 126 degrees of freedom
## Multiple R-squared:  0.5923, Adjusted R-squared:  0.5761 
## F-statistic: 36.61 on 5 and 126 DF,  p-value: < 0.00000000000000022

After we re-leveled the race variable to make white employees the reference group, black employees and asian employees both received, no average, statistically significantly lower percentage pay raises than white employees when controlling for the other predictor variables in the model (b = -.595, p < .001 and b = -1.182, p < .001, respectively). All other results remain the same as the version of the model we estimated prior to re-leveling the race variable, with the exception of the model intercept (which is not of substantive interest) in these analyses.

In sum, there is evidence that employee race/ethnicity influenced manager decisions when it came to determining the pay raise percentages for employees, which could be evidence of implicit or explicit bias. The (unadjusted multiple) R² and adjusted R² values are .592 and .576 (F = 36.61, p < .001); focusing on the unadjusted R² value, approximately 59% of the variance in percentage pay raise received by employees is explained by their education level, supervisor-rated performance scores, employee sex, and employee race/ethnicity. Because the F-value associated with the model is statistically significant (p < .001), we can conclude that this model fits the data better than a model containing no predictor variables.

63.2.4.3 Compare Nested Models

As a final step in our hierarchical linear regression, let’s use the anova function from base R to do a nested model comparison between the step1 and step2 models. This will tell us whether adding sex and race to the model significantly improved model fit (i.e., explained incrementally more variance in the outcome) over and above educ and perf. Type the name of the anova function. As the first argument, type the name of our first step model object (step1), and as the second argument, type the name of our second step model object (step2).

# Nested Model Comparison: Step 1 vs. Step 2
anova(step1, step2)

## Analysis of Variance Table
## 
## Model 1: raise ~ educ + perf
## Model 2: raise ~ educ + perf + sex + race
##   Res.Df    RSS Df Sum of Sq      F           Pr(>F)    
## 1    129 82.541                                         
## 2    126 54.604  3    27.936 21.488 0.00000000002637 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and its associated p-value (the latter of which is less than the conventional cutoff of .05) indicate that the second step model fits the data significantly better than the first step model. In other words, sex and race explain incremental variance in raise beyond the variance already explained by educ and perf. In the context of a pay raise, this is of course concerning because an employee’s sex or race/ethnicity should not factor into the percentage raise they receive.

Finally, let’s reference the summary/output for each model and specifically the r.squared values to determine what the change in R² was from the first step to the second step. The change in R² gives as an idea of how much (in terms of practical significance) the model fit improved when adding the additional variables (i.e., sex, race). Note that we use the $ symbol to indicate that we want the r.squared value from the summary(step2) and summary(step1) output, and then we do a simple subtraction to get the change in R². Note that we typically only interpret the change in R² if we find that the models fit the data significantly differently from one another based on the F-value and its associated p-value in the model comparison step.

# Change in R-squared
summary(step2)$r.squared - summary(step1)$r.squared

## [1] 0.2085752

As you can see, the change in R² is .209, which indicates that sex and race explain approximately 21% additional variance in raise beyond the variance already explained by educ and perf. If we consider our conventional rules of thumb for interpreting R² as an effect size (i.e., indicator of practical significance), this is medium-to-large change, which is impressive in terms of magnitude (and deeply concerning in this specific context).

R²	Description
.01	Small
.09	Medium
.25	Large

63.2.4.4 Optional: More Information on Dummy Coding

Dummy coding (contrast coding) is used in regression when we have a categorical predictor variable. To illustrate dummy coding, I reference the results from Step 2 above. In a simple two-level/category categorical variable such as sex (female, male) just one dummy variable is needed to represent sex, and by default, the lm function uses the name of the level/category that comes first alphabetically as the reference group, setting it equal to 0 (e.g., female = 0); in turn, the name of the level/category that comes next alphabetically is used as the focal group and is set equal to 1 (e.g., male = 1). Using this coding, a positive and significant regression coefficient for male (i.e., sexmale) would indicate that male employees have higher scores/values on the outcome variable. Things get a little bit trickier when we have a categorical variable with three or more categories/levels. Using the present findings as an example, behind the scenes, the lm function has created two dichotomous “dummy” variables with respect to the race variable, both of which reference the focal level/category (e.g., raceblack) as 1 and all other levels/categories (e.g., asian, white) as 0. The number of dummy variables created to analyze a categorical variable is equal to the number of categories/levels (k) minus 1 (k - 1); in this example, we have three categories/levels (asian, black, white), and thus two dummy variables were created behind the scenes. Only two dummy variables are needed for a categorical variable with three levels/categories because it can be inferred what a third dummy variable’s values would be if we have two dummy variables’ values. In this example, because the asian level/category comes first alphabetically, it is treated as the reference group, which means its mean value is reflected in the intercept/constant value of -1.22478 when controlling for other variables in the model. We can interpret each dummy variable regression coefficient as the difference between the mean of the focal level/category (e.g., raceblack) and the intercept (e.g., asian) when controlling for other variables in the model. Thus, in the second step model above, black employees received significantly higher percentage pay raises than asian employees (b = .587, p < .001) when controlling for other predictor variables in the model, and likewise white employees received significantly higher percentage pay raises than asian employees (b = 1.182, p < .001).

Let’s dig deeper into dummy coding and illustrate its connection to analysis of variance (ANOVA). For solely explanatory purposes, let’s take a look at a model that only contains the dummy variables for the race variable in relation to the raise variable. Let’s name this model object mod.

# Example model with dummy coding: Regress raise on race
mod <- lm(raise ~ race, data=pd)
summary(mod)

## 
## Call:
## lm(formula = raise ~ race, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78478 -0.59925  0.00645  0.61522  2.31522 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   2.8848     0.1313  21.965 < 0.0000000000000002 ***
## raceasian    -1.2912     0.2070  -6.238        0.00000000586 ***
## raceblack    -0.6684     0.1780  -3.756             0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8908 on 129 degrees of freedom
## Multiple R-squared:  0.2358, Adjusted R-squared:  0.224 
## F-statistic:  19.9 on 2 and 129 DF,  p-value: 0.00000002925

Note that in this model, the intercept/constant value of 1.594 represents the mean percentage raise for asian employees. The regression coefficient associated with raceblack (b = .623) represents the mean percentage raise for black employees minus the mean percentage raise for asian employees; because the values is positive, it means that black employees have a descriptively higher mean than asian employees. Next, the regression coefficient associated with racewhite (b = 1.291) represents the mean percentage raise for black employees minus the mean percentage raise for asian employees; because the values is positive, it means that white employees also have a descriptively higher mean than asian employees.

We can confirm that the regression coefficient values mean outcome for each level of the categorical variable by using the tapply and mean functions to calculate the mean for the raise variable for each level of the race variable.

# Mean values on outcome variable for each level of categorical variable
tapply(pd$raise, pd$race, mean)

##    white    asian    black 
## 2.884783 1.593548 2.216364

Note that the intercept/constant value for above, which represents the mean percentage raise for the reference group (asian employees), is 1.594; this is the same as the mean we just calculated. In addition, if we add the regression coefficient for the dummy variable associated with black employees to the mean of the reference group (asian employees), we get 2.216 (1.5935 + .6228 = 2.2164), which is equal to the mean for black employees we calculated. Finally, if we add the regression coefficient for the dummy variable associated with white employees to the mean of the reference group (asian employees), we get 2.885 (1.5935 + 1.2912 = 2.8848), which is equal to the mean for white employees we calculated.

Next, using the anova function from base R, we can interpret the regression model (mod) as a one-way ANOVA.

# Display ANOVA results
anova(mod)

## Analysis of Variance Table
## 
## Response: raise
##            Df  Sum Sq Mean Sq F value        Pr(>F)    
## race        2  31.585 15.7927  19.904 0.00000002925 ***
## Residuals 129 102.353  0.7934                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The row in the table associated with race can be interpreted as an omnibus F-test for a one-way ANOVA. As you can see, the race variable explained a significant amount of the variability in raise (F = 19.904, p < .001). However, like any omnibus test with three or more levels for the categorical predictor variable, we don’t know yet where the significant mean differences exist. The summary of the regression model above, provides us with information about those mean comparisons in the form of regression coefficients. In addition to illustrating how dummy coding works, this also shows that a one-way ANOVA (and other types of ANOVA) can be implemented and interpreted in a regression framework. This just goes to show that ANOVA is a specific type of regression. Please note, however, that this ANOVA model did not include the merit-based predictor variables and the protected characteristic variable of sex, so its results cannot be compared directly to the results of our Step 2 multiple linear regression model above.

Finally, as a note of caution, it is not uncommon for a dummy variable (or multiple dummy variables) to be imbalanced, meaning that one category has disproportionately more representation in the data than the other. As such, it’s generally a good idea to compute the counts/frequencies for a categorical variable to determine if there are any categories that have disproportionately lower representation than others. If you do find that some categories have very few cases, then those categories should probably be filtered out prior to estimating a model wherein dummy variables will be created.

63.2.5 Summary

In this chapter, we learned how to perform hierarchical linear regression in order to identify pay determinants. Hierarchical linear regression is a specific application of multiple linear regression in which we build up to a final/full model in what are often called steps or blocks. Specifically, we are comparing nested models. In this tutorial, we learned how to use the lm and anova functions from base R to conduct hierarchical linear regression.

References

Sady, Kayo, Michael G Aaomodt, and David Cohen. 2017. “Compensation Equity: Who, What, When, Where, Why, and How.” National Conference on EEO Law.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.