Chapter 21 Centering & Standardizing Variables
In this chapter, we will learn how to center and standardize variables.
21.1 Conceptual Overview
Centering or standardizing variables can be a useful data preparation step. For example, we often center predictor variables prior to specifying a product term (i.e., interaction term) when estimating moderation effects in a multiple linear regression model.
21.1.1 Review of Centering Variables
Centering is the process of subtracting the variable mean (average) from each of the values of that same variable; in other words, it’s a linear rescaling of a variable. Centering variables is sometimes completed prior to including those variables as predictors in a regression model, and it is generally done for one or both of the following purposes: (a) to make the intercept valuable more interpretable, and (b) to reduce collinearity between two or more predictor variables that are subsequently multiplied to create an interaction term (product term) when estimating a moderated multiple linear regression model or polynomial regression model, for example. Regarding the first purpose, centering to enhance the interpretability of the intercept (constant) value in a regression model is relevant to the extent that we wish to interpret the intercept value. In an ordinary least squares (OLS) regression model, the intercept value represents the mean of the outcome variable when all predictor variables are set to zero. If the scaling of any of the predictor variables in our model does not include zero (e.g., predictor variables are on a 1-10 scale), then considering the intercept value when the predictors are zero, doesn’t make much sense. Regarding the second purpose, it is important to center predictor variables prior to using them to create a multiplicative term (i.e., interaction variable), such as an interaction term (i.e., product term) or a polynomial term (e.g., quadratic term, cubic term). In the context of a moderated multiple linear regression model, centering does not affect the significance of the interaction term, but a lack of centering will affect the interpretation of the main effects.
Thus far, I have been mostly referring to what is referred to as grand-mean centering. To grand-mean center a variable, we simply subtract the overall (grand) mean of the entire sample for that variable from each value of that variable, thereby creating a new variable in which the mean is zero and the standard deviation is the same as it was before centering. For example, let’s assume that we have a variable called Age
for a sample of individuals, where Age
is measured in years. In it’s raw format, the mean of Age
is 36.2 years with a standard deviation of 8.1. If we grand-mean center Age
, then for each individual in our sample, we create a new variable in which take their Age
and subtract the mean Age
of 36.2. If an individual has an Age
of 40.0, then their centered Age
would be 3.8 (40.0 - 36.2 = 3.8). By centering each individual’s Age
relative to the grand-mean, we end up with a variable that has a mean of 0.0, but with a standard deviation that is equal to 8.1, which is equal to the standard deviation of the original Age
variable. Why is this the case? Well, we have just performed a linear shift of Age
, which affects only the mean and not the standard deviation.
In the context of multilevel models, group-mean centering becomes relevant and important. In short, group-mean centering refers to the process of subtracting the respective group mean for a particular variable (based on another variable that acts as a grouping or clustering variable) for each case’s score on that same variable. For example, if Employee A belongs to Work Team A, and Work Team A consists of 10 other employees, we would first calculate the mean of Work Team A employees’ scores on a continuous variable of interest, and second, we would subtract that group mean from each Work Team A employee’s score on that continuous variable. We would then repeat this process for all employees relative to their respective work teams. In the context of multilevel modeling, both grand-mean centering and group-mean centering can have pronounced on the estimated coefficients and the interpretation of those coefficients. For instance, group-mean centering can be used in multilevel models to separate out the within-group effects and the between-groups effects, if that is of interest. A full discussion of grand-mean centering and group-mean centering in the context of multilevel modeling is beyond the scope of this tutorial; for a more complete overview, please check out Professor Jason Newsom’s handout: http://web.pdx.edu/~newsomj/mlrclass/ho_centering.pdf.
21.1.2 Review of Standardizing Variables
Like centering variables, when standardizing (or scaling) variables, we center the variables around a mean of zero. However, when standardizing a variable, we are actually converting the variable to a z-score, which means we set the mean to 0 and the variance to 1; because the standard deviation is just the square root of the variance, then the standard deviation is also set to 1. So how do you interpret a variable that is standardized? Let’s assume that we standardized a variable called Age
for a sample of individuals, where Age
is measured in years. In it’s raw format, the mean of Age
is 36.2 years with a standard deviation of 8.1, which would mean, for example, that a person who is 44.3 years old has an Age
that is exactly 1 standard deviation higher than the mean (44.3 - 8.1 = 36.2). If we standardize the Age
variable, then the mean becomes 0.0 and the standard deviation becomes 1.0. Accordingly, the standardized score for the person who has an Age
of 44.3 years (which was 1 standard deviation above the mean) would become 1.0. If a person has an Age
of 20.0, then that means they have a standardized score of -2.0, which represents 2.0 standard deviations below the mean (20.0 - 36.2 = -16.2 and -16.2 / 8.1 = -2.0).
21.2 Tutorial
This chapter’s tutorial demonstrates how to center and standardize variables in R.
21.2.1 Video Tutorial
As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below. Please note that the video shows how to grand-mean center and standardize variables – but not how to group-mean center variables. If your goal is to group-mean center variables, then check out the corresponding section below.
Link to video tutorial: https://youtu.be/2_TxnvZGtV0
21.2.3 Initial Steps
If you haven’t already, save the file called “DiffPred.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"
). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.
Next, using the setwd
function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.
Next, read in the .csv data file called “DiffPred.csv” using your choice of read function. In this example, I use the read_csv
function from the readr
package (Wickham, Hester, and Bryan 2024). If you choose to use the read_csv
function, be sure that you have installed and accessed the readr
package using the install.packages
and library
functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
df <- read_csv("DiffPred.csv")
## Rows: 377 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): emp_id, gender, race
## dbl (3): perf_eval, interview, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "emp_id" "perf_eval" "interview" "age" "gender" "race"
## spc_tbl_ [377 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ emp_id : chr [1:377] "MA322" "MA323" "MA324" "MA325" ...
## $ perf_eval: num [1:377] 4.2 4.9 4.2 5.3 4.2 6.9 3.4 5.8 4.4 5.6 ...
## $ interview: num [1:377] 7.5 9.3 7.5 8 9.3 6.8 6.7 7 8.2 6.4 ...
## $ age : num [1:377] 29.7 31.7 29.4 37.9 30.9 46.2 43.9 47.8 31.7 44.6 ...
## $ gender : chr [1:377] "woman" "man" "woman" "woman" ...
## $ race : chr [1:377] "asian" "asian" "asian" "asian" ...
## - attr(*, "spec")=
## .. cols(
## .. emp_id = col_character(),
## .. perf_eval = col_double(),
## .. interview = col_double(),
## .. age = col_double(),
## .. gender = col_character(),
## .. race = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## # A tibble: 6 × 6
## emp_id perf_eval interview age gender race
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 MA322 4.2 7.5 29.7 woman asian
## 2 MA323 4.9 9.3 31.7 man asian
## 3 MA324 4.2 7.5 29.4 woman asian
## 4 MA325 5.3 8 37.9 woman asian
## 5 MA326 4.2 9.3 30.9 man black
## 6 MA327 6.9 6.8 46.2 woman asian
There are 6 variables and 377 cases (i.e., employees) in the DiffPred
data frame: emp_id
, perf_eval
, interview
, age
, gender
, and race
. Per the output of the str
(structure) function above, the variables perf_eval
, interview
, and age
are of type numeric (continuous: interval/ratio), and the variables emp_id
, gender
, and race
are of type character (nominal/categorical). The variable emp_id
is the unique employee identifier. Imagine that these data were collected as part of a criterion-related validation study - specifically, a concurrent validation design in which job incumbents were administered a rated structured interview (interview
) 90 days after entering the organization. The structured interview (interview
) variable was designed to assess individuals’ interpersonal skills, and ratings can range from 1 (very weak interpersonal skills) to 10 (very strong interpersonal skills). The interviews were scored by untrained raters who were often the hiring managers but not always. The perf_eval
variable is the criterion (outcome) of interest, and it is a 90-day-post-hire measure of supervisor-rated job performance, with possible ratings ranging from 1-7, with 7 indicating high performance. The age
variable represents the job incumbents’ ages (in years). The gender
variable represents the job incumbents’ tender identify and is defined by two levels/categories/values: man and woman. Finally, the race
variable represents the job incumbents’ race/ethnicity and is defined by three levels/categories/values: asian, black, and white.
21.2.4 Grand-Mean Center Variables
We only center variables that are of type numeric and that we conceptualize as having a continuous (interval/ratio) measurement scale. Further, if we’re centering variables prior to inclusion in a regression model, we often only center those variables that we plan on using as predictor variables (and not outcome variables). Thus, in our current data frame, we will grand-mean center just the interview
and age
variables; for more information on which variables to center, check out the chapter on estimating moderation effects in a multiple linear regression model. I will demonstrate three approaches, and you can try all three or just one, as any of the three will work.
21.2.4.1 Option 1: Basic Arithmetic and mean
Function from Base R
We’ll start with what is arguably the most intuitive approach for grand-mean centering. We must begin by coming up with a new name for one of our soon-to-be grand-mean centered variable, and in this example, we will center the interview
variable. I typically like to name the centered variable by simply adding the c_
prefix to the existing variable’s names (e.g., c_interview
). Type the name of the data frame object to which the new centered variable will be attached (df
), followed by the $
operator and the name of the new variable we are creating (c_interview
). Next, add the <-
operator to indicate what values you will assign to this new variable. To create a vector of values to assign to this new c_interview
variable, we will subtract the mean (average) score for the original variable (interview
) from each case’s value on the variable. Specifically, enter the name of the data frame object, followed by the $
operator and the name of the original variable (interview
). After that, enter the subtraction symbol (-
). And finally, type the name of the mean
function from base R. As the first argument in the mean
function, enter the name of the data frame object (df
), followed by the $
operator and the name of the original variable (interview
). As the second argument, enter na.rm=TRUE
to indicate that you wish to drop missing values when calculating the grand mean for the sample.
# Grand-mean centering: Using basic arithmetic and the mean function from base R
df$c_interview <- df$interview - mean(df$interview, na.rm=TRUE)
To admire your work, take a look at the first six rows of your data frame object to inspect the new grand-mean centered variable called c_interview
.
## # A tibble: 6 × 7
## emp_id perf_eval interview age gender race c_interview
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 MA322 4.2 7.5 29.7 woman asian 0.245
## 2 MA323 4.9 9.3 31.7 man asian 2.04
## 3 MA324 4.2 7.5 29.4 woman asian 0.245
## 4 MA325 5.3 8 37.9 woman asian 0.745
## 5 MA326 4.2 9.3 30.9 man black 2.04
## 6 MA327 6.9 6.8 46.2 woman asian -0.455
21.2.4.2 Option 2: scale
Function from Base R
An alternative approach to grand-mean centering is to use the scale
function from base R. For some, this function might be preferable to the approach described above, but again, it’s really a matter of preference. As an initial step, start by coming up with a new name for one of our soon-to-be grand-mean centered variable, and in this example, we will center the interview
variable. As I mentioned above, I typically like to name the centered variable by simply adding the c_
prefix to the existing variable’s names - for example: c_interview
. Now, enter the name of the data frame object to which the new centered variable will be attached (df
), followed by the $
operator and the name of the new variable we are creating (c_interview
). Next, add the <-
operator to indicate what values you will assign to this new variable. To create a vector of values to assign to this new c_interview
variable, begin by typing the name of the scale
function. As the first argument, type the name of the data frame object (df
), followed by the $
operator and the name of the original variable (interview
). As the second argument, type center=TRUE
which instructs the function to grand-mean center the values. As the third argument, type scale=FALSE
to inform the function that you do not which to scale or standardize the variable you are centering.
# Grand-mean centering: Using scale function from base R
df$c_interview <- scale(df$interview, center=TRUE, scale=FALSE)
Take a look at the first six rows of your data frame object to inspect the new grand-mean centered variable called c_interview
.
## # A tibble: 6 × 7
## emp_id perf_eval interview age gender race c_interview[,1]
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 MA322 4.2 7.5 29.7 woman asian 0.245
## 2 MA323 4.9 9.3 31.7 man asian 2.04
## 3 MA324 4.2 7.5 29.4 woman asian 0.245
## 4 MA325 5.3 8 37.9 woman asian 0.745
## 5 MA326 4.2 9.3 30.9 man black 2.04
## 6 MA327 6.9 6.8 46.2 woman asian -0.455
21.2.4.3 Option 3: mutate
Function from dplyr
and mean
Function from Base R
This third approach to grand-mean centering variables can come in handy when we want to grand-mean center multiple variables in a single step. With that said, I will begin by showing how to grand-mean center a single variable using this approach, and then we will extend the code/script to involve two variables. We will be using the mutate
function from the dplyr
package (Wickham et al. 2023), so if you haven’t already, be sure to install and access the dplyr
package using the functions below.
In this example, I use the pipe (%>%
) operator from the dplyr
package (and by extension, the magrittr
package). For more information on using pipes with the mutate
function, check out the chapter on cleaning data, and for more information on using pipes in general, check out this free eBook by one of the creator’s of the dplyr
package and RStudio.
- To begin, type the name of the data frame object you wish to create (or overwrite), and in this example, we are going to overwrite our existing data frame (
df
) by naming our new data frame object the same thing. To create and name the object, we use the<-
assignment operator. - Type the name of the original data frame to which the variable we wish to grand-mean center belongs (
df
), followed by the pipe (%>%
) operator. - Type the name of the
mutate
function. As the first and only argument, begin by typing what you would like to name the new grand-mean centered variable (c_interview
). After that, type the=
operator to assign values to this new variable. Finally, we will specify a formula to inform the function how the new values will be calculated. Specifically, we type the name of the original variable (interview
), type the subtraction (-
) operator, and finally type the name of themean
function from base R. As the first argument in themean
function, type the name of the variable (interview
) for which you would like to calculate the mean, and as the second argument, enterna.rm=TRUE
to indicate that you wish to drop missing values when computing this grand mean for the sample.
# Grand-mean centering: Using mutate function from dplyr and mean function from base R
df <- df %>%
mutate(c_interview = interview - mean(interview, na.rm=TRUE))
Take a look at the first six rows of your data frame object to inspect the new grand-mean centered variable called c_interview
.
## # A tibble: 6 × 7
## emp_id perf_eval interview age gender race c_interview
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 MA322 4.2 7.5 29.7 woman asian 0.245
## 2 MA323 4.9 9.3 31.7 man asian 2.04
## 3 MA324 4.2 7.5 29.4 woman asian 0.245
## 4 MA325 5.3 8 37.9 woman asian 0.745
## 5 MA326 4.2 9.3 30.9 man black 2.04
## 6 MA327 6.9 6.8 46.2 woman asian -0.455
One of the advantages of using this approach is that we can center multiple variables in a single step. To do so, we simply specify which additional variable we would like to grand-mean center by adding an additional argument to the mutate
function.
21.2.5 Group-Mean Center Variables
When estimating multilevel models, there are certain contexts in which group-mean centering should be applied. For more information on centering in general and group-mean centering specifically, please check out this handout created by my colleague Jason Newsom.
Like we did above with Option 3 for grand-mean centering, for group-mean centering we will also use the mutate
function from dplyr
; however, we will also go a step further by applying the group_by
function from dplyr
to group the data by a values of a categorical (nominal, ordinal) grouping variable. For more information grouping data, check out the chapter on aggregation and segmentation.
Let’s group-mean center interview
scores by race
variable categories.
- To begin, type the name of the data frame object you wish to create (or overwrite), and in this example, we are going to overwrite our existing data frame (
df
) by naming our new data frame object the same thing. To create and name the object, we use the<-
assignment operator. - Type the name of the original data frame to which the variable we wish to grand-mean center belongs (
df
), followed by the pipe (%>%
) operator. - Type the name of the
group_by
function, and as the function’s argument(s), specify the name(s) of the grouping variable(s); in this example, we will group by therace
variable. Follow this function with the the pipe (%>%
) operator. - Type the name of the
mutate
function. As the first and only argument, begin by typing what you would like to name the new group-mean centered variable (gpmc_interview)
. After that, type the=
operator to assign values to this new variable. Finally, we will specify a formula to inform the function how the new values will be calculated. Specifically, we type the name of the original variable (interview
), type the subtraction (-
) operator, and finally type the name of themean
function from base R. As the first argument in themean
function, type the name of the variable (interview
) for which you would like to calculate the mean, and as the second argument, enterna.rm=TRUE
to indicate that you wish to drop missing values when computing this grand mean for the sample. Follow themutate
function with the pipe (%>%
) operator. - Finally, type the name of the
ungroup
function, and don’t specify any arguments within the function’s parentheses. This last step makes sure that we ungroup the grouping that we initially applied to the data frame.
# Group-mean centering by race variable
df <- df %>%
group_by(race) %>%
mutate(gpmc_interview = interview - mean(interview, na.rm=TRUE)) %>%
ungroup()
Take a look at the first six rows of your data frame object to inspect the new group-mean centered variable called gpmc_interview
.
## # A tibble: 6 × 9
## emp_id perf_eval interview age gender race c_interview c_age gpmc_interview
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 MA322 4.2 7.5 29.7 woman asian 0.245 -7.04 0.730
## 2 MA323 4.9 9.3 31.7 man asian 2.04 -5.04 2.53
## 3 MA324 4.2 7.5 29.4 woman asian 0.245 -7.34 0.730
## 4 MA325 5.3 8 37.9 woman asian 0.745 1.16 1.23
## 5 MA326 4.2 9.3 30.9 man black 2.04 -5.84 0.877
## 6 MA327 6.9 6.8 46.2 woman asian -0.455 9.46 0.0305
21.2.6 Standardize Variables
There are different approaches we can use to standardize a variable. I provide two options below. You can try both options or one. Both will get you to the same end. I suggest picking the one that is most intuitive for you.
21.2.6.1 Option 1: scale
Function from Base R
As our first approach to standardizing (or scaling) a variable, we will use the scale
function from base R. In fact, we can even apply this function within the lm
(linear regression) function from base R to get standardized regression coefficients; for more information on standardized regression coefficients, check out the chapters on predicting criterion scores using simple linear regression and estimating incremental validity using multiple linear regression. Using the scale
function on its own is fairly straightforward when the goal is to standardize. Start by coming up with a new name for one of our soon-to-be grand-mean centered variable, and in this example, we will center the interview
variable. I typically like to name the standardized variable by simply adding the st_
prefix to the existing variable’s names (e.g., st_interview
). Now, enter the name of the data frame object to which the new centered variable will be attached (df
), followed by the $
operator and the name of the new variable we are creating (st_interview
). Next, add the <-
operator to indicate what values you will assign to this new variable. To assign values to this new st_interview
variable, begin by typing the name of the scale
function. As the first and only argument, type the name of the data frame object (df
), followed by the $
operator and the name of the original variable (interview
).
Take a look at your data frame object to inspect the new standardized variable called st_interview
.
## # A tibble: 6 × 10
## emp_id perf_eval interview age gender race c_interview c_age gpmc_interview st_interview[,1]
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MA322 4.2 7.5 29.7 woman asian 0.245 -7.04 0.730 0.203
## 2 MA323 4.9 9.3 31.7 man asian 2.04 -5.04 2.53 1.70
## 3 MA324 4.2 7.5 29.4 woman asian 0.245 -7.34 0.730 0.203
## 4 MA325 5.3 8 37.9 woman asian 0.745 1.16 1.23 0.618
## 5 MA326 4.2 9.3 30.9 man black 2.04 -5.84 0.877 1.70
## 6 MA327 6.9 6.8 46.2 woman asian -0.455 9.46 0.0305 -0.378
21.2.6.2 Option 2: mutate
Function from dplyr
and scale
Function from Base R
This alternative approach to standardizing variables can come in handy when we want to standardize multiple variables in a single step. With that said, I will begin by showing how to standardize a single variable using this approach, and then we will try two variables. We will use the mutate
function from the dplyr
package, so if you haven’t already, be sure to install and access the dplyr
package using the functions below.
As the first step, enter the name of the data frame object you wish to create (or overwrite), and in this example, we are going to overwrite our existing data frame (df
) by naming our new data frame object the same thing. To create and name the object, we use the <-
operator. Second, type the name of the original data frame to which the variable we wish to standardize belongs (df
). Third, type the pipe (%>%
) operator. Fourth, type the name of the mutate
function. As the first and only argument, begin by typing what you would like to call the new standardized variable, which in this case we will call the variable st_interview
. After that, type the name of the scale
function from base R. As the first and only argument in the scale
function, type the name of the variable (interview
) that you would like to standardize.
# Standardizing: Using mutate function from dplyr and scale function from base R
df <- df %>%
mutate(st_interview = scale(interview))
Take a look at your data frame object to inspect the new standardized variable called st_interview
.
## # A tibble: 6 × 10
## emp_id perf_eval interview age gender race c_interview c_age gpmc_interview st_interview[,1]
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MA322 4.2 7.5 29.7 woman asian 0.245 -7.04 0.730 0.203
## 2 MA323 4.9 9.3 31.7 man asian 2.04 -5.04 2.53 1.70
## 3 MA324 4.2 7.5 29.4 woman asian 0.245 -7.34 0.730 0.203
## 4 MA325 5.3 8 37.9 woman asian 0.745 1.16 1.23 0.618
## 5 MA326 4.2 9.3 30.9 man black 2.04 -5.84 0.877 1.70
## 6 MA327 6.9 6.8 46.2 woman asian -0.455 9.46 0.0305 -0.378
One of the advantages of using this approach is that we can standardize multiple variables in a single step. To do so, we simply specify which additional variable we would like to standardize by adding an additional argument to the mutate
function.
21.2.7 Summary
In this chapter, we learned how to grand-mean center and standardize variables. For grand-mean centering variables, we used basic arithmetic and the mean
function from base R, the scale
function from base R, and a combination of the mutate
function from dplyr
and the mean
function from base R. For standardizing variables, we used the scale
function from base R and a combination of the mutate
function from dplyr
and the scale
function from base R.