Chapter 30 Creating a Composite Variable Based on a Multi-Item Measure

In this chapter, we will learn how to create a composite variable based on scores from a multi-item measure. To justify which items should be included or excluded when creating the composite variable, we will compute Cronbach’s alpha ($\alpha$) as an indicator of internal consistency reliability.

30.1 Conceptual Overview

Sometimes it is useful to create a composite variable out of the sum or mean of multiple variables’ scores. This process results in each case (e.g., observation, employee, individual) receiving a composite score based on the sum or mean of their scores on the variables used to create the composite variable. As an example, imagine that a direct supervisor rates each of their team members on five dimensions from a performance evaluation measure, which results in five separate variables corresponding to the five dimensions. A composite variable could be created for each team member by taking the average of the five ratings each individual received, such that the resulting variable represents each team member’s overall level of job performance.

So why do we create composite variables? Well, psychological constructs (e.g., attitudes, behaviors, feelings) are often multi-faceted, which means that single item will likely not “capture” the full underlying concept (i.e., construct) space or domain. Continuing with the performance evaluation example, job performance is often multi-faceted, as for a given job, the conceptual performance domain often involves a variety of different behavior types (e.g., customer service, administrative). By creating a composite variable out of variables that are intended to measure the different sub-facets of the conceptual performance domain, we can create an indicator of overall job performance.

An overall scale score variable is a specific type of composite variable in which scores on items from a multi-item scale (i.e., measure) are combined (by computing the sum or the mean) into a new variable. This process results in each case (e.g., observation, employee, individual) receiving an overall scale score based on the sum or mean of their scores on the items from a given measure. For example, imagine that employees respond to an annual survey containing a three-item job satisfaction measure. An overall scale score variable could be created by computing each respondent’s average response scores to the three job satisfaction items and applying these to a new composite variable that represents each respondent’s overall level of job satisfaction.

To justify whether it is appropriate to create a composite variable or which variables should be used to create the composite variable, we can estimate internal consistency reliability (via Cronbach’s alpha) for the set of variables (or subsets of those items). Based on data from a given sample, internal consistency reliability provides an indication of how homogeneous a set of variables are in terms of their scores. If you need a refresher on internal consistency reliability, refer to the previous chapter on estimating internal consistency reliability using Cronbach’s alpha. Finally, as a reminder, the following table includes the qualitative descriptors that can be used to interpret Cronbach’s alpha.

Cronbach’s alpha ($\alpha$)	Qualitative Descriptor
.95-1.00	Excellent
.90-.94	Great
.80-.89	Good
.70-.79	Acceptable
.60-.69	Questionable
.00-.59	Unacceptable

30.2 Tutorial

This chapter’s tutorial demonstrates how to create a composite variable based on scores from a multi-item measure. To justify which items to include in the creation of the composite variable, we will use Cronbach’s alpha ($\alpha$) as an indicator of internal consistency reliability.

30.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/vdmYv0YnWEE

30.2.2 Functions & Packages Introduced

Function	Package
`rowMeans`	base R
`rowSums`	base R
`c`	base R
`names`	base R

30.2.3 Initial Steps

If you haven’t already, save the file called “survey.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data file called “survey.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2024). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")

# Access readr package
library(readr)

# Read data and name data frame (tibble) object
df <- read_csv("survey.csv")

## Rows: 156 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): SurveyID, JobSat1, JobSat2, JobSat3, TurnInt1, TurnInt2, TurnInt3, Engage1, Engage2, Engage3, Engage4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Print the names of the variables in the data frame (tibble) object
names(df)

##  [1] "SurveyID" "JobSat1"  "JobSat2"  "JobSat3"  "TurnInt1" "TurnInt2" "TurnInt3" "Engage1"  "Engage2"  "Engage3"  "Engage4"

# Print number of rows in data frame (tibble) object
nrow(df)

## [1] 156

# Print top 6 rows of data frame (tibble) object
head(df)

## # A tibble: 6 × 11
##   SurveyID JobSat1 JobSat2 JobSat3 TurnInt1 TurnInt2 TurnInt3 Engage1 Engage2 Engage3 Engage4
##      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1        1       3       3       3        3        3        3       2       1       2       2
## 2        2       4       4       4        3        3        2       4       4       4       4
## 3        3       4       4       5        2        1        2       4       4       4       4
## 4        4       2       3       3        4        4        4       4       4       4       4
## 5        5       3       3       3        4        3        3       3       3       3       3
## 6        6       3       3       3        3        2        2       4       4       5       3

The data frame includes annual employee survey responses from 156 employees to three Job Satisfaction items (JobSat1, JobSat2, JobSat3), three Turnover Intentions items (TurnInt1, TurnInt2, TurnInt3), and four Engagement items (Engage1, Engage2, Engage3, Engage4). Employees responded to each item using a 5-point response format, ranging from Strongly Disagree (1) to Strongly Agree (5). Assume that higher scores on an item indicate higher levels of that variable; for example, a higher score on TurnInt1 would indicate that the respondent has higher intentions of quitting the organization.

30.2.4 Compute Cronbach’s alpha

To justify the creation of a composite variable (i.e., overall scale score variable) for one of the multi-item survey measures, we’ll first estimate internal consistency reliability using Cronbach’s alpha. To do so, we will use the alpha function from the psych package. To get started, install and access the psych package using the install.packages and library functions, respectively (if you haven’t already done so).

# Install package
install.packages("psych")

# Access package
library(psych)

Now let’s compute Cronbach’s alpha for the four-item engagement measure.

# Estimate Cronbach's alpha for the four-item Engagement measure
alpha(df[,c("Engage1","Engage2","Engage3","Engage4")])

## 
## Reliability analysis   
## Call: alpha(x = df[, c("Engage1", "Engage2", "Engage3", "Engage4")])
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##       0.84      0.84     0.8      0.56 5.1 0.021  3.5 0.66     0.56
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.79  0.84  0.87
## Duhachek  0.79  0.84  0.88
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se   var.r med.r
## Engage1      0.78      0.78    0.71      0.55 3.6    0.030 0.00048  0.54
## Engage2      0.78      0.78    0.71      0.54 3.5    0.030 0.00355  0.53
## Engage3      0.82      0.82    0.75      0.60 4.5    0.025 0.00072  0.60
## Engage4      0.79      0.79    0.72      0.55 3.7    0.030 0.00524  0.54
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean   sd
## Engage1 156  0.83  0.83  0.75   0.69  3.6 0.77
## Engage2 156  0.84  0.84  0.76   0.70  3.4 0.83
## Engage3 156  0.77  0.78  0.66   0.61  3.4 0.76
## Engage4 156  0.84  0.83  0.74   0.68  3.6 0.86
## 
## Non missing response frequency for each item
##            1    2    3    4    5 miss
## Engage1 0.01 0.06 0.35 0.49 0.10    0
## Engage2 0.02 0.10 0.42 0.39 0.06    0
## Engage3 0.00 0.10 0.44 0.40 0.06    0
## Engage4 0.01 0.08 0.30 0.47 0.13    0

Note: If you see the following message at the top or bottom of your output, you can often safely ignore it – that is, unless you know that one or more items should have been reverse-coded. If an item needs to be reverse coded, then you would need to take care of that prior to running the alpha function.

Some items ( [ITEM NAME] ) were negatively correlated with the total scale and probably should be reversed.

The raw alpha (raw_alpha) based on all four engagement items exceeds the acceptable threshold of .70, and the Reliability if an item is dropped output table indicates that removing an item would result in a lower Cronbach’s alpha (i.e., lower internal consistency reliability estimate). Further, let’s imagine that the conceptual definition for engagement is the extent to which a person feels enthusiastic, energized, and driven to perform their work., and the items’ content are as follows:

Engage1 - “When I’m working, I’m full of energy.”
Engage2 - “I complete my work with enthusiasm.”
Engage3 - “I find inspiration in my work.”
Engage4 - “I have no problem working for long periods of time.”

We will retain all four items when computing the composite variable for engagement because:

Cronbach’s alpha for all four items is above .70 (and thus acceptable);
Removing an item would decrease Cronbach’s alpha; and
The content of all four items seems to fit within the conceptual definition of engagement.

As a reminder, for the purposes of this book, we will consider a scale with an alpha greater than or equal to .70 to demonstrate acceptable internal consistency for the particular sample, whereas an alpha that falls within the range of .60-.69 would be considered questionable, and an alpha below .60 would be deemed unacceptable. Here is a table of more nuanced qualitative descriptors for Cronbach’s alpha:

Cronbach’s alpha ($\alpha$)	Qualitative Descriptor
.95-1.00	Excellent
.90-.94	Great
.80-.89	Good
.70-.79	Acceptable
.60-.69	Questionable
.00-.59	Unacceptable

For a more in-depth review of internal consistency reliability and justifying which items (if any) to remove, be sure to check out the previous chapter.

30.2.5 Create a Composite Variable

To create a composite variable, we will use the rowMeans function from base R, as it offers a straightforward approach. The function also allows us to decide what to do with cases that have missing data on one or more of the variables (e.g., items). Given that we feel justified to created an composite variable (i.e., overall scale score variable) based on the four engagement items (Engage1, Engage2, Engage3, Engage4), we will include all four of those items in our rowMeans function. First, let’s come up with a name for the composite variable we’re about to create; here, I decided to call the new variable Engage_Overall, as the variable will represent overall engagement. Second, we’ll append the new variable called Engage_Overall to the df data frame using the $ operator to indicate that the new variable will be added to that data frame object. Third, we’ll use the <- operator to indicate that we are assigning the results of the rowMeans function to the new variable. Fourth, we will type the name of the rowMeans function. Fifth, as the first argument, type the name of the data frame object to which the items belong (df). Sixth, following the data frame name, type in brackets ([ ]), and within the brackets, type a comma (,) followed by the c (combine) function; by placing a comma in front of the c function, we are indicating that we will be referencing the names of columns (i.e., variables); within the c function, list the name of each item in quotation marks (" "), separated by commas (,). Finally, as the second argument in the rowMeans function, type the na.rm=TRUE argument, which will tell the function to compute the mean for each case that has at least one score for the specified items; in other words, this function allows for the row means to be computed even if there are missing data. [Note: If you wish to create a composite variable based on the sum of item scores, you can use the rowSums function from base R.]

# Create composite (overall scale score) variable based on Engagement items
df$Engage_Overall <- rowMeans(df[,c("Engage1","Engage2","Engage3","Engage4")], 
                              na.rm=TRUE)

Let’s take a look at the variables in our df data frame object by using the names function from base R.

# Print variable names to Console
names(df)

##  [1] "SurveyID"       "JobSat1"        "JobSat2"        "JobSat3"        "TurnInt1"       "TurnInt2"       "TurnInt3"      
##  [8] "Engage1"        "Engage2"        "Engage3"        "Engage4"        "Engage_Overall"

Note that we now have a variable called Engage_Overall, which is our composite variable meant to represent overall engagement for each survey respondent.

You can take a closer look at the composite scores for the new Engage_Overall variable by using the View function from base R.

# View variable names
View(df)

30.2.6 Summary

In this chapter, we learned how to create a composite variable (e.g., overall scale score variable) based on scores from a multi-item measure. To justify which items to include in the composite variable, we computed Cronbach’s alpha ($\alpha$) as an estimate of internal consistency reliability. The rowMeans function (and rowSums function) from base R is quite useful when it comes to creating composite variables.

References

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.