Chapter 30 Creating a Composite Variable Based on a Multi-Item Measure
In this chapter, we will learn how to create a composite variable based on scores from a multi-item measure. To justify which items should be included or excluded when creating the composite variable, we will compute Cronbach’s alpha (\(\alpha\)) as an indicator of internal consistency reliability.
30.1 Conceptual Overview
Sometimes it is useful to create a composite variable out of the sum or mean of multiple variables’ scores. This process results in each case (e.g., observation, employee, individual) receiving a composite score based on the sum or mean of their scores on the variables used to create the composite variable. As an example, imagine that a direct supervisor rates each of their team members on five dimensions from a performance evaluation measure, which results in five separate variables corresponding to the five dimensions. A composite variable could be created for each team member by taking the average of the five ratings each individual received, such that the resulting variable represents each team member’s overall level of job performance.
So why do we create composite variables? Well, psychological constructs (e.g., attitudes, behaviors, feelings) are often multi-faceted, which means that single item will likely not “capture” the full underlying concept (i.e., construct) space or domain. Continuing with the performance evaluation example, job performance is often multi-faceted, as for a given job, the conceptual performance domain often involves a variety of different behavior types (e.g., customer service, administrative). By creating a composite variable out of variables that are intended to measure the different sub-facets of the conceptual performance domain, we can create an indicator of overall job performance.
An overall scale score variable is a specific type of composite variable in which scores on items from a multi-item scale (i.e., measure) are combined (by computing the sum or the mean) into a new variable. This process results in each case (e.g., observation, employee, individual) receiving an overall scale score based on the sum or mean of their scores on the items from a given measure. For example, imagine that employees respond to an annual survey containing a three-item job satisfaction measure. An overall scale score variable could be created by computing each respondent’s average response scores to the three job satisfaction items and applying these to a new composite variable that represents each respondent’s overall level of job satisfaction.
To justify whether it is appropriate to create a composite variable or which variables should be used to create the composite variable, we can estimate internal consistency reliability (via Cronbach’s alpha) for the set of variables (or subsets of those items). Based on data from a given sample, internal consistency reliability provides an indication of how homogeneous a set of variables are in terms of their scores. If you need a refresher on internal consistency reliability, refer to the previous chapter on estimating internal consistency reliability using Cronbach’s alpha. Finally, as a reminder, the following table includes the qualitative descriptors that can be used to interpret Cronbach’s alpha.
Cronbach’s alpha (\(\alpha\)) | Qualitative Descriptor |
---|---|
.95-1.00 | Excellent |
.90-.94 | Great |
.80-.89 | Good |
.70-.79 | Acceptable |
.60-.69 | Questionable |
.00-.59 | Unacceptable |
30.2 Tutorial
This chapter’s tutorial demonstrates how to create a composite variable based on scores from a multi-item measure. To justify which items to include in the creation of the composite variable, we will use Cronbach’s alpha (\(\alpha\)) as an indicator of internal consistency reliability.
30.2.1 Video Tutorial
As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.
Link to video tutorial: https://youtu.be/vdmYv0YnWEE
30.2.2 Functions & Packages Introduced
Function | Package |
---|---|
rowMeans |
base R |
rowSums |
base R |
c |
base R |
names |
base R |
30.2.3 Initial Steps
If you haven’t already, save the file called “survey.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"
). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.
Next, using the setwd
function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.
Next, read in the .csv data file called “survey.csv” using your choice of read function. In this example, I use the read_csv
function from the readr
package (Wickham, Hester, and Bryan 2024). If you choose to use the read_csv
function, be sure that you have installed and accessed the readr
package using the install.packages
and library
functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
df <- read_csv("survey.csv")
## Rows: 156 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): SurveyID, JobSat1, JobSat2, JobSat3, TurnInt1, TurnInt2, TurnInt3, Engage1, Engage2, Engage3, Engage4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "SurveyID" "JobSat1" "JobSat2" "JobSat3" "TurnInt1" "TurnInt2" "TurnInt3" "Engage1" "Engage2" "Engage3" "Engage4"
## [1] 156
## # A tibble: 6 × 11
## SurveyID JobSat1 JobSat2 JobSat3 TurnInt1 TurnInt2 TurnInt3 Engage1 Engage2 Engage3 Engage4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3 3 3 3 3 3 2 1 2 2
## 2 2 4 4 4 3 3 2 4 4 4 4
## 3 3 4 4 5 2 1 2 4 4 4 4
## 4 4 2 3 3 4 4 4 4 4 4 4
## 5 5 3 3 3 4 3 3 3 3 3 3
## 6 6 3 3 3 3 2 2 4 4 5 3
The data frame includes annual employee survey responses from 156 employees to three Job Satisfaction items (JobSat1
, JobSat2
, JobSat3
), three Turnover Intentions items (TurnInt1
, TurnInt2
, TurnInt3
), and four Engagement items (Engage1
, Engage2
, Engage3
, Engage4
). Employees responded to each item using a 5-point response format, ranging from Strongly Disagree (1) to Strongly Agree (5). Assume that higher scores on an item indicate higher levels of that variable; for example, a higher score on TurnInt1
would indicate that the respondent has higher intentions of quitting the organization.
30.2.4 Compute Cronbach’s alpha
To justify the creation of a composite variable (i.e., overall scale score variable) for one of the multi-item survey measures, we’ll first estimate internal consistency reliability using Cronbach’s alpha. To do so, we will use the alpha
function from the psych
package. To get started, install and access the psych
package using the install.packages
and library
functions, respectively (if you haven’t already done so).
Now let’s compute Cronbach’s alpha for the four-item engagement measure.
# Estimate Cronbach's alpha for the four-item Engagement measure
alpha(df[,c("Engage1","Engage2","Engage3","Engage4")])
##
## Reliability analysis
## Call: alpha(x = df[, c("Engage1", "Engage2", "Engage3", "Engage4")])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.84 0.84 0.8 0.56 5.1 0.021 3.5 0.66 0.56
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.79 0.84 0.87
## Duhachek 0.79 0.84 0.88
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Engage1 0.78 0.78 0.71 0.55 3.6 0.030 0.00048 0.54
## Engage2 0.78 0.78 0.71 0.54 3.5 0.030 0.00355 0.53
## Engage3 0.82 0.82 0.75 0.60 4.5 0.025 0.00072 0.60
## Engage4 0.79 0.79 0.72 0.55 3.7 0.030 0.00524 0.54
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Engage1 156 0.83 0.83 0.75 0.69 3.6 0.77
## Engage2 156 0.84 0.84 0.76 0.70 3.4 0.83
## Engage3 156 0.77 0.78 0.66 0.61 3.4 0.76
## Engage4 156 0.84 0.83 0.74 0.68 3.6 0.86
##
## Non missing response frequency for each item
## 1 2 3 4 5 miss
## Engage1 0.01 0.06 0.35 0.49 0.10 0
## Engage2 0.02 0.10 0.42 0.39 0.06 0
## Engage3 0.00 0.10 0.44 0.40 0.06 0
## Engage4 0.01 0.08 0.30 0.47 0.13 0
Note: If you see the following message at the top or bottom of your output, you can often safely ignore it – that is, unless you know that one or more items should have been reverse-coded. If an item needs to be reverse coded, then you would need to take care of that prior to running the alpha
function.
Some items ( [ITEM NAME] ) were negatively correlated with the total scale and probably should be reversed.
The raw alpha (raw_alpha
) based on all four engagement items exceeds the acceptable threshold of .70, and the Reliability if an item is dropped
output table indicates that removing an item would result in a lower Cronbach’s alpha (i.e., lower internal consistency reliability estimate). Further, let’s imagine that the conceptual definition for engagement is the extent to which a person feels enthusiastic, energized, and driven to perform their work., and the items’ content are as follows:
Engage1
- “When I’m working, I’m full of energy.”Engage2
- “I complete my work with enthusiasm.”Engage3
- “I find inspiration in my work.”Engage4
- “I have no problem working for long periods of time.”
We will retain all four items when computing the composite variable for engagement because:
- Cronbach’s alpha for all four items is above .70 (and thus acceptable);
- Removing an item would decrease Cronbach’s alpha; and
- The content of all four items seems to fit within the conceptual definition of engagement.
As a reminder, for the purposes of this book, we will consider a scale with an alpha greater than or equal to .70 to demonstrate acceptable internal consistency for the particular sample, whereas an alpha that falls within the range of .60-.69 would be considered questionable, and an alpha below .60 would be deemed unacceptable. Here is a table of more nuanced qualitative descriptors for Cronbach’s alpha:
Cronbach’s alpha (\(\alpha\)) | Qualitative Descriptor |
---|---|
.95-1.00 | Excellent |
.90-.94 | Great |
.80-.89 | Good |
.70-.79 | Acceptable |
.60-.69 | Questionable |
.00-.59 | Unacceptable |
For a more in-depth review of internal consistency reliability and justifying which items (if any) to remove, be sure to check out the previous chapter.
30.2.5 Create a Composite Variable
To create a composite variable, we will use the rowMeans
function from base R, as it offers a straightforward approach. The function also allows us to decide what to do with cases that have missing data on one or more of the variables (e.g., items). Given that we feel justified to created an composite variable (i.e., overall scale score variable) based on the four engagement items (Engage1
, Engage2
, Engage3
, Engage4
), we will include all four of those items in our rowMeans
function. First, let’s come up with a name for the composite variable we’re about to create; here, I decided to call the new variable Engage_Overall
, as the variable will represent overall engagement. Second, we’ll append the new variable called Engage_Overall
to the df
data frame using the $
operator to indicate that the new variable will be added to that data frame object. Third, we’ll use the <-
operator to indicate that we are assigning the results of the rowMeans
function to the new variable. Fourth, we will type the name of the rowMeans
function. Fifth, as the first argument, type the name of the data frame object to which the items belong (df
). Sixth, following the data frame name, type in brackets ([ ]
), and within the brackets, type a comma (,
) followed by the c
(combine) function; by placing a comma in front of the c
function, we are indicating that we will be referencing the names of columns (i.e., variables); within the c
function, list the name of each item in quotation marks (" "
), separated by commas (,
). Finally, as the second argument in the rowMeans
function, type the na.rm=TRUE
argument, which will tell the function to compute the mean for each case that has at least one score for the specified items; in other words, this function allows for the row means to be computed even if there are missing data. [Note: If you wish to create a composite variable based on the sum of item scores, you can use the rowSums
function from base R.]
# Create composite (overall scale score) variable based on Engagement items
df$Engage_Overall <- rowMeans(df[,c("Engage1","Engage2","Engage3","Engage4")],
na.rm=TRUE)
Let’s take a look at the variables in our df
data frame object by using the names
function from base R.
## [1] "SurveyID" "JobSat1" "JobSat2" "JobSat3" "TurnInt1" "TurnInt2" "TurnInt3"
## [8] "Engage1" "Engage2" "Engage3" "Engage4" "Engage_Overall"
Note that we now have a variable called Engage_Overall
, which is our composite variable meant to represent overall engagement for each survey respondent.
You can take a closer look at the composite scores for the new Engage_Overall
variable by using the View
function from base R.
30.2.6 Summary
In this chapter, we learned how to create a composite variable (e.g., overall scale score variable) based on scores from a multi-item measure. To justify which items to include in the composite variable, we computed Cronbach’s alpha (\(\alpha\)) as an estimate of internal consistency reliability. The rowMeans
function (and rowSums
function) from base R is quite useful when it comes to creating composite variables.