Chapter 20 Manipulating & Restructuring Data

In this chapter, we will learn how to manipulate and restructure data – and more specifically, convert a data frame object from wide to long format and from long to wide format.

20.1 Conceptual Overview

Data manipulation and restructuring refers to the process in which data are restructured or formatted in a different manner. When we think about structured data, we often think in terms of wide versus long formats. Wide format data are structured such that there are more columns (i.e., variables) and fewer rows than with long format data. For example, most survey platforms allow you to download the survey responses in a wide format, such that each variable (that corresponds to a survey item/question) will have its own column. Long format data are structured such that that are (sometimes) fewer columns yet more rows than with wide-format data. For example, we can restructure survey data from wide format such that one variable (i.e., column) contains the names of the different survey items/questions, and another variable contains the responses/scores for each item/question. Often our decision to manipulate data into wide versus long formats has to do with the type of analysis or visualization we plan to perform. As such, understanding how to manipulate and restructure your data is an important data-science skill.

Data can be manipulated into different structures to accomplish different goals. For example, a dataset in wide format might include employees’ scores on pre-test and post-test assessments as separate variables (i.e., columns), where one column includes the pre-test scores and one column includes the post-test scores. The same dataset could, alternatively, be manipulated or restructured into long format, such that the time of the test administration becomes a categorical variable and a separate variable contains the scores; In the wide-format dataset, each employee has a single row of data, whereas in the long-format dataset each employee has one row of data for their pre-test and one row of data for their post-test.

20.2 Tutorial

This chapter’s tutorial demonstrates how to restructure data from wide to long format and from long to wide format.

20.2.1 Video Tutorial

In the video tutorial below, I demonstrate how to use two data-manipulation functions that have since been deprecated by the tidyr package, which are called gather and spread. In this chapter, I review how to use the new data-manipulation functions from the tidyr package called pivot_longer and pivot_wider, respectively. I personally find the new functions to be more intuitive to use, but perhaps you’ll disagree. Regardless of which you choose, both will help you manipulate data from wide to long format and from long to wide format.

Link to video tutorial: https://youtu.be/5DgEAKLRjhw

20.2.2 Functions & Packages Introduced

Function Package
pivot_longer tidyr
c base R
pivot_wider tidyr

20.2.3 Initial Steps

If you haven’t already, save the file called “ManipulatingData.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data file called “ManipulatingData.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2022). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)

# Read data and name data frame (tibble) object
datam <- read_csv("ManipulatingData.csv")
## Rows: 20 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): SurveyID, JobSatisfaction, TurnoverIntentions, OrgCommitment, JobInvolvement
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) object
names(datam)
## [1] "SurveyID"           "JobSatisfaction"    "TurnoverIntentions" "OrgCommitment"      "JobInvolvement"
# Print data frame (tibble) object
print(datam)
## # A tibble: 20 × 5
##    SurveyID JobSatisfaction TurnoverIntentions OrgCommitment JobInvolvement
##       <dbl>           <dbl>              <dbl>         <dbl>          <dbl>
##  1        1            55.4               97.2          32.3           50.5
##  2        2            51.5               96            53.4           50.3
##  3        3            46.2               94.5          63.9           50.2
##  4        4            42.8               91.4          70.3           50.3
##  5        5            40.8               88.3          34.1           50.5
##  6        6            38.7               84.9          67.7           30.5
##  7        7            35.6               79.9          53.3           30.5
##  8        8            33.1               77.6          63.5           30.5
##  9        9            29                 74.5          68             30.5
## 10       10            26.2               71.4          67.4           30.5
## 11       11            23.1               66.4          15.6           30.5
## 12       12            22.3               61.8          71.8           30.5
## 13       13            22.3               57.2          70.2           30.6
## 14       14            23.3               52.9          64.9           30.5
## 15       15            25.9               51            62.2           30.5
## 16       16            29.5               51            67.3           30.5
## 17       17            32.8               51            40.6           30.5
## 18       18            35.4               51.4          74.7           30.5
## 19       19            40.3               51.4          71.8           50.1
## 20       20            56.7               56            84.3           69.5

Note in the data frame that the SurveyID variable (i.e., column, field) is a unique identifier variable, which means that each case (i.e., observation) has a unique value on this variable. Each row represents a unique employee’s composite score on four measures (i.e., JobSatisfaction, TurnoverIntentions, OrgCommitment, JobInvolvement), where higher scores indicate higher levels of the concept (i.e., construct) being measured.

# Determine number of rows in data frame
nrow(datam)
## [1] 20

Note how the data frame currently has 20 rows (cases) of data, as shown by the output to the nrow function from base R.

20.2.4 Wide-to-Long Format Data Manipulation

The data frame we read in called datam is in wide format, as each substantive variable has its own column. To restructure the data from wide to long format, we will use the pivot_longer function from the tidyr package (Wickham and Girlich 2022). Along with readr and dplyr (as well as other useful packages), the tidyr package is part of the tidyverse of packages. Let’s begin by installing and accessing the tidyr package so that we can use the pivot_longer function.

# Install tidyr package if you haven't already
install.packages("tidyr")
# Access tidyr package
library(tidyr)

If you received an error when attempting to access the tidyr package using the library function, you may need to install the following packages using the install.packages function: rlang and glue. Alternatively, you may try installing the entire tidyverse package.

Now that we’ve accessed the tidyr package, I will demonstrate two techniques for applying the pivot_longer function.

The first technique uses the pipe operator (%>%). The pipe operator comes from a package called magrittr (Bache and Wickham 2022), on which the tidyr package is partially dependent. In short, a pipe allows a person to more efficiently write code and to improve the readability of the code and overall script. Specifically, a pipe forwards the result or value of one object or expression to a subsequent function. In doing so, one can avoid writing functions in which other functions are nested parenthetically. For more information on the pipe operator, check out Wickham and Grolemund’s (2017) chapter on pipes.

The second technique for applying the pivot_longer function takes a more traditional approach in that it involves nested functions being nested parenthetically. If you don’t want to learn how to use pipes (or would like to learn how to use them at a later date), feel free to skip to the section below called Without Pipe.

20.2.4.1 With Pipe

Using the pipe (%>%) operator technique, let’s apply the pivot_longer function to manipulate the datam data frame object from wide format to long format.

We can specify the wide-to-long manipulation as follows.

  1. Create a name for a new data frame object to which we will eventually assign a long-format data frame object; here, I name the new data frame object datam_long.
  2. Use the <- operator to assign the new long-form data frame object to the object named datam_long in the step above.
  3. Type the name of the original data frame object (datam), followed by the pipe (%>%) operator.
  4. Type the name of the pivot_longer function.
  1. As the first argument in the pivot_longer function, type cols= followed by the c (combine) function. As the arguments within the c function, list the names of the variables that you wish to pivot from separate variables (wide) to levels or categories of a new variable, effectively stacking them vertically. In this example, let’s list the names of the four survey measures: JobSatisfaction, TurnoverIntentions, OrgCommitment, and JobInvolvement.
  2. As the second argument in the pivot_longer function, type names_to= followed by what you would like to name the new stacked variable (see previous) created from the four survey measure variables. Let’s call the new variable containing the names of the measures the following: "Measure".
  3. As the third argument in the pivot_longer function, type values_to= followed by what you would like to name the new variable that contains the scores for the four survey variables that are now stacked vertically for each case. Let’s call the new variable containing the scores on the four measures the following: "Score".
# Apply pivot_longer function to restructure data in long format (using pipe)
datam_long <- datam %>% 
  pivot_longer(cols=c(JobSatisfaction, TurnoverIntentions, OrgCommitment, JobInvolvement),
               names_to="Measure",
               values_to="Score")

# Print first 12 rows of new data frame
head(datam_long, n=12)
## # A tibble: 12 × 3
##    SurveyID Measure            Score
##       <dbl> <chr>              <dbl>
##  1        1 JobSatisfaction     55.4
##  2        1 TurnoverIntentions  97.2
##  3        1 OrgCommitment       32.3
##  4        1 JobInvolvement      50.5
##  5        2 JobSatisfaction     51.5
##  6        2 TurnoverIntentions  96  
##  7        2 OrgCommitment       53.4
##  8        2 JobInvolvement      50.3
##  9        3 JobSatisfaction     46.2
## 10        3 TurnoverIntentions  94.5
## 11        3 OrgCommitment       63.9
## 12        3 JobInvolvement      50.2

As you can see, in the output, each respondent now has four rows of data – one row for each measure and the associated score. The giveaway is that each respondent’s unique SurveyID value is repeated four times.

# Print variable names
names(datam_long)
## [1] "SurveyID" "Measure"  "Score"

Note how there are now just three variables: SurveyID, Measure, and Score. Further, by applying the nrow function to the new data frame (as shown below), we can see that the new data frame is now much longer in terms of the number of rows; specifically, the original datam data frame in wide format has 20 rows, and now the new datam_long data frame in long format has 80 rows.

# Print number of rows in data frame
nrow(datam_long)
## [1] 80

Now apply the View function from base R to view and scroll through the whole data frame. Note that the unique identifier variable (i.e., SurveyID) repeats multiple times to indicate that the same survey respondent has scores in multiple rows.

# View entire data frame
View(datam_long)

20.2.4.2 Without Pipe

We can also apply the pivot_longer function without using the pipe (%>%) operator. To do so, use the same code as above, except drop the pipe (%>%) operator and type data= followed by the name of the original data frame (datam) as the first argument within the pivot_longer function parentheses.

# Apply pivot_longer function to restructure data in long format (without using pipe)
datam_long <- pivot_longer(data=datam,
                           cols=c(JobSatisfaction, TurnoverIntentions, OrgCommitment, JobInvolvement),
                           names_to="Measure",
                           values_to="Score")

# Print first 12 rows of new data frame
head(datam_long, n=12)
## # A tibble: 12 × 3
##    SurveyID Measure            Score
##       <dbl> <chr>              <dbl>
##  1        1 JobSatisfaction     55.4
##  2        1 TurnoverIntentions  97.2
##  3        1 OrgCommitment       32.3
##  4        1 JobInvolvement      50.5
##  5        2 JobSatisfaction     51.5
##  6        2 TurnoverIntentions  96  
##  7        2 OrgCommitment       53.4
##  8        2 JobInvolvement      50.3
##  9        3 JobSatisfaction     46.2
## 10        3 TurnoverIntentions  94.5
## 11        3 OrgCommitment       63.9
## 12        3 JobInvolvement      50.2

Regardless of whether we use the pipe (%>%) operator, we end up with the same wide-to-long format manipulation.

20.2.5 Long-to-Wide Format Data Manipulation

Using the long-format data frame we just created called datam_long, let’s manipulate it back to wide format using the pivot_wider function from the tidyr package. As we did above, we’ll do the long-to-wide data manipulation with and without the pipe (%>%) operator.

20.2.5.1 With Pipe

Using the pipe (%>%) operator technique, let’s apply the pivot_wider function to manipulate the datam_long data frame object from long format back to wide format.

We can specify the long-to-wide manipulation as follows.

  1. Create a name for a new data frame object to which we will eventually assign a long-format data frame object; here, I name the new data frame object datam_wide.
  2. Use the <- operator to assign the new wide-form data frame object to the object named datam_wide in the step above.
  3. Type the name of the long-format data frame object (datam_long), followed by the pipe (%>%) operator.
  4. Type the name of the pivot_wider function.
  1. As the first argument in the pivot_wider function, type names_from= followed by the name of the variable that contains the names of the different survey measures: Measure. The levels or categories of this Measure variable will become the names of separate columns (i.e., variables) when the long-format data frame object is converted to a wide-format data frame object.
  2. As the second argument in the pivot_wider function, type values_from= followed by the name of the variable that contains the values (i.e., scores) for each of the survey measures: Score. The values from the Score variable will provide the data for the new survey-measure variables when the data frame object is in wide format.
# Apply pivot_wider function to restructure data in wide format (using pipe)
datam_wide <- datam_long %>% 
  pivot_wider(names_from=Measure,
              values_from=Score)

# Print first 12 rows of new data frame
head(datam_wide, n=12)
## # A tibble: 12 × 5
##    SurveyID JobSatisfaction TurnoverIntentions OrgCommitment JobInvolvement
##       <dbl>           <dbl>              <dbl>         <dbl>          <dbl>
##  1        1            55.4               97.2          32.3           50.5
##  2        2            51.5               96            53.4           50.3
##  3        3            46.2               94.5          63.9           50.2
##  4        4            42.8               91.4          70.3           50.3
##  5        5            40.8               88.3          34.1           50.5
##  6        6            38.7               84.9          67.7           30.5
##  7        7            35.6               79.9          53.3           30.5
##  8        8            33.1               77.6          63.5           30.5
##  9        9            29                 74.5          68             30.5
## 10       10            26.2               71.4          67.4           30.5
## 11       11            23.1               66.4          15.6           30.5
## 12       12            22.3               61.8          71.8           30.5

Now let’s print the variable names in the new datam_wide data frame object.

# Print variable names
names(datam_wide)
## [1] "SurveyID"           "JobSatisfaction"    "TurnoverIntentions" "OrgCommitment"      "JobInvolvement"

Note how we are now back to the same variables from our original wide-format data frame called datam: SurveyID, JobInvolvement, JobSatisfaction, OrgCommitment, and TurnoverIntentions. Further, by applying the nrow function to the new data frame (as shown below), we can see that the new wide-format data frame is now back to 20 rows. We have come full circle, as the pivot_wider function complements the pivot_longer function.

# Print number of rows in data frame
nrow(datam_wide)
## [1] 20

20.2.5.2 Without Pipe

We can also apply the pivot_wider function without the pipe (%>%) operator. Simply use the same code as above, except drop the pipe (%>%) operator and type data= followed by the name of the long-format data frame object (datam_long) as the first argument within the pivot_wider function parentheses.

# Apply pivot_wider function to restructure data in wide format (without using pipe)
datam_wide <- pivot_wider(data=datam_long, 
                          names_from=Measure,
                          values_from=Score)

# Print first 12 rows of new data frame
head(datam_wide, n=12)
## # A tibble: 12 × 5
##    SurveyID JobSatisfaction TurnoverIntentions OrgCommitment JobInvolvement
##       <dbl>           <dbl>              <dbl>         <dbl>          <dbl>
##  1        1            55.4               97.2          32.3           50.5
##  2        2            51.5               96            53.4           50.3
##  3        3            46.2               94.5          63.9           50.2
##  4        4            42.8               91.4          70.3           50.3
##  5        5            40.8               88.3          34.1           50.5
##  6        6            38.7               84.9          67.7           30.5
##  7        7            35.6               79.9          53.3           30.5
##  8        8            33.1               77.6          63.5           30.5
##  9        9            29                 74.5          68             30.5
## 10       10            26.2               71.4          67.4           30.5
## 11       11            23.1               66.4          15.6           30.5
## 12       12            22.3               61.8          71.8           30.5

Regardless of whether we use the pipe (%>%) operator, we end up with the same long-to-wide format manipulation.

20.2.6 Summary

In this chapter, we learned how to manipulate a data frame from wide format to long format using the pivot_longer, and how to manipulate a data frame from long format to wide format using the pivot_wider. Both functions are from the tidyr package. Manipulating and restructuring data into different formats or structures is often an essential data-management step prior to running certain analyses or generating certain data visualizations.

References

Bache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A Forward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.
Wickham, Hadley, and Maximilian Girlich. 2022. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Visualize, Model, Transform, Tidy, and Import Data. Sebastopol, California: O’Reilly Media, Inc. https://r4ds.had.co.nz/n.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2022. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.