Chapter 17 Joining (Merging) Data

In this chapter, we will learn the fundamentals of joins (merges). Specifically, we will learn how to join (merge) data horizontally and vertically.

17.1 Conceptual Overview

Joining (merging) refers to the process of matching two data frames by either one or more key variables (i.e., horizontal join) or by variable names or columns (i.e., vertical join). Sometimes a join is referred to as a merge and vice versa, and thus I will use these terms interchangeably throughout the chapter. Broadly speaking, there are two types of joins (merges): horizontal and vertical.

17.1.1 Review of Horizontal Joins (Merges)

A horizontal join (merge) refers to the process of matching cases (i.e., rows, observations) between two data frames using a key variable (matching variable), which results in distinct sets of variables (i.e., fields, columns) being combined horizontally (laterally) across two data frames. The resulting joined data frame will be wider (in terms of the number of variables) than either of the original data frames in isolation. For example, imagine that we pull data from separate information systems, each with different variables (i.e., fields) but at least some employees (i.e., cases) in common; to combine these two data frames, we can perform a horizontal join. This is often a necessary step when creating a data frame that contains all of the variables we will need in subsequent data analyses. For instance, if we wish to estimate the criterion-related validities using the selection tool scores from one data frame with criterion (e.g., job performance) scores from another data frame, then we could perform a horizontal join, assuming we have a key variable with which to match the scores from the two data frames.

In a horizontal join, cases (or observations) are matched between two data frames using one or more key variables.

We will focus on four different types of horizontal joins:

  1. Inner join: All unmatched cases (or observations) are dropped, thereby retaining only those cases that are present in both the left (x, first) and right (y, second) data frames. In other words, a case is only included in the merged data frame if it appears in both of the original data data frames.

In an inner join, all unmatched cases (or observations) are dropped, thereby retaining only those cases that are present in both the left (x, first) and right (y, second) data frames.

  1. Full join: All cases (or observations) are retained, including those cases that do not have a match in the other data data frame. In other words, a case is included in the merged data frame even if it only appears in one of the original data data frames. These type of join leads to the highest number of retained cases under conditions in which both data frames contain unique cases.

In a full join, all cases (or observations) are retained, including those cases that do not have a match in the other data data frame.

  1. Left join: All cases (or observations) that appear in the left (x, first) data frame are retained, even if they lack a match in the right (y, second) data frame. Consequently, cases from the right data frame that lack a match in the left data frame are dropped in the merged data frame.

In a left join, all cases (or observations) that appear in the left (x, first) data frame are retained, even if they lack a match in the right (y, second) data frame.

  1. Right join: All cases (or observations) that appear in the right (y, second) data frame are retained, even if they lack a match in the left (x, first) data frame. Consequently, cases from the left data frame that lack a match in the right data frame are dropped in the merged data frame.

In a right join, only cases (or observations) that appear in the right (y, second) data frame are retained, even if they lack a match in the left (x, first) data frame.

Please note that I have illustrated different types of horizontal joins using a single key variable. It is entirely possible to perform horizontal joins using two or more key variables. For example, imagine that each morning we administered a pulse survey to employees and each afternoon we afternoon we administered a different pulse survey to the same employees, and that we repeated this process for five consecutive workdays. In this instance, we would likely need to horizontally join the data frames using both a unique employee identifier variable and a unique day-of-week variable.

17.1.2 Review of Vertical Joins (Merges)

A vertical join (merge) refers to the process of matching identical variables from two data frames, which results in distinct sets of cases or observations being combined vertically. The resulting joined data frame will be longer (in terms of the number of cases) than either of the original data frames in isolation. For example, imagine an organization administered the same survey to two facilities (i.e., independent groups) each with unique employees; we could combine the two resulting data frames by performing a vertical join.

In a vertical join, identical variables are matched between two data frames, each with distinct sets of cases or observations.

17.2 Tutorial

This chapter’s tutorial demonstrates how to join (merge) cases from two data frames.

17.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below. Both versions of the tutorial will show you how to join (merge) data with or without the pipe (%>%) operator. If you’re unfamiliar with the pipe operator, no need to worry: I provide a brief explanation and demonstration regarding their purpose in both versions of the tutorial. Finally, please note that in the chapter supplement you have an opportunity to learn how to join (merge) cases from data frames using a function from base R, which you may find preferable or more intuitive.

Link to video tutorial: https://youtu.be/38zsLj-fWo0

17.2.2 Functions & Packages Introduced

Function Package
right_join dplyr
left_join dplyr
inner_join dplyr
full_join dplyr
data.frame base R
c base R
rep base R
rbind base R

17.2.3 Initial Steps

If you haven’t already, save the files called “PersData.csv” and “PerfData.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data files for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data files called “PersData.csv” and “PerfData.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2022). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)

# Read data and name data frame (tibble) objects
personaldata <- read_csv("PersData.csv")
## Rows: 9 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): lastname, firstname, startdate, gender
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
performancedata <- read_csv("PerfData.csv")
## Rows: 6 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): id, perf_q1, perf_q2, perf_q3, perf_q4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) objects
names(personaldata)
## [1] "id"        "lastname"  "firstname" "startdate" "gender"
names(performancedata)
## [1] "id"      "perf_q1" "perf_q2" "perf_q3" "perf_q4"
# Print data frame (tibble) objects
print(personaldata)
## # A tibble: 9 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   153 Sanchez    Alejandro 1/1/2016  male  
## 2   154 McDonald   Ronald    1/9/2016  male  
## 3   155 Smith      John      1/9/2016  male  
## 4   165 Doe        Jane      1/4/2016  female
## 5   125 Franklin   Benjamin  1/5/2016  male  
## 6   111 Newton     Isaac     1/9/2016  male  
## 7   198 Morales    Linda     1/7/2016  female
## 8   201 Providence Cindy     1/9/2016  female
## 9   282 Legend     John      1/9/2016  male
print(performancedata)
## # A tibble: 6 × 5
##      id perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1   153     3.9     4.8     4.9     5  
## 2   125     2.1     1.9     2.1     2.3
## 3   111     3.3     3.3     3.4     3.3
## 4   198     4.9     4.5     4.4     4.8
## 5   201     1.2     1.1     1       1  
## 6   282     2.2     2.3     2.4     2.5

As you can see from the output generated in your Console, on the one hand, the personaldata data frame object contains basic employee demographic information. The variable names include: id, lastname, firstname, startdate, and gender. On the other hand, the performancedata data frame object contains the same id unique identifier variable as the personaldata data frame object, but instead of employee demographic information, this data frame object includes variables associated with quarterly employee performance: perf_q1, perf_q2, perf_q3, and perf_q4.

In order to better illustrate certain join functions later on in this chapter, we’ll begin by removing the case (i.e., employee) associated with the id variable value of 153 (i.e., Alejandro Sanchez); in terms of a rationale for doing so, let’s imagine that Alejandro no longer works for the organization, and thus we would like to remove him from the personaldata data frame. If you don’t completely understand the following process for removing this individual from the data frame, no need to worry, as you will learn more in the subsequent chapter on filtering data.

  1. Type the name of the data frame object (personaldata) followed by the <- operator to overwrite the existing data frame object.
  2. Type the name of the original data frame object (personaldata) followed by brackets ([ ]).
  3. Within the brackets ([ ]), type the name of the data frame object (personaldata) again, followed by the $ operator and the name of the variable we wish to use to select the case that will be removed, which in this instance is the id unique identifier variable. The $ operator indicates to R that the id variable belongs to the personaldata data frame.
  4. Type the “not equal to” operator, which is != (the ! means “not”), followed by the id variable value we wish to use to remove the case (i.e., 153).
  5. Type a comma (,) to indicate that we are removing a row, not a column. When referencing rows and columns in R, as we are doing in the brackets ([ ]), rows are entered first (before a comma), and columns are entered second (after a comma). In doing so, we are telling R to retain all rows of data in personaldata except for the one corresponding to id equal to 153.
# Remove case with id variable equal to 153
personaldata <- personaldata[personaldata$id != 153,]

Check out the first 6 rows of the updated data frame for personaldata, and note that the data corresponding to the case associated with id equal to 153 is gone.

# Print first 6 rows of first data frame object once more
head(personaldata)
## # A tibble: 6 × 5
##      id lastname firstname startdate gender
##   <dbl> <chr>    <chr>     <chr>     <chr> 
## 1   154 McDonald Ronald    1/9/2016  male  
## 2   155 Smith    John      1/9/2016  male  
## 3   165 Doe      Jane      1/4/2016  female
## 4   125 Franklin Benjamin  1/5/2016  male  
## 5   111 Newton   Isaac     1/9/2016  male  
## 6   198 Morales  Linda     1/7/2016  female

17.2.4 Horizontal Join (Merge)

Recall that a horizontal join (merge) means that cases are matched using one more more key variables, and as a result, variables (i.e., columns, fields) are combined across two data frames. We will review two options for performing horizontal joins.

To perform horizontal joins, we will learn how to use the join functions from the the dplyr package (Wickham et al. 2022), which include: right_join, left_join, inner_join, and full_join. Please note that there are other functions we could use to perform horizontal joins, and if you’re interested, in the Joining (Merging) Data: Chapter Supplement, I demonstrate how to use the merge function from base R to carry out the same operations we will cover below.

Using the aforementioned join functions, we will match cases from the personaldata and performancedata data frames using the id unique identifier variable as a key variable. So how can we verify that id is an appropriate key variable? Well, let’s use the names function from base R to retrieve the list of variable names from the two data frames, which we already did above. Nevertheless, let’s call up those variable names once more. Simply enter the name of the data frame as a parenthetical argument in the names function.

# Retrieve variable names from first data frame
names(personaldata)
## [1] "id"        "lastname"  "firstname" "startdate" "gender"
# Retrieve variable names from second data frame
names(performancedata)
## [1] "id"      "perf_q1" "perf_q2" "perf_q3" "perf_q4"

As you can see in the variable names listed above, the id variable is common to both data frames, and thus it will serve as our key variable.

Now we are almost ready to begin joining the two data frames using the id unique identifier as a key variable. Before doing so, however, we should make sure that we have installed and accessed the dplyr package (if we haven’t already), as the join functions come from that package.

# Install dplyr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("dplyr")
# Access dplyr package
library(dplyr)

I will demonstrate two techniques for applying the join function.

The first technique uses the pipe operator (%>%). The pipe operator comes from a package called magrittr (Bache and Wickham 2022), on which the dplyr is partially dependent. In short, a pipe allows a person to more efficiently write code and to improve the readability of the code and overall script. Specifically, a pipe forwards the result or value of one object or expression to a subsequent function. In doing so, one can avoid writing functions in which other functions are nested parenthetically. For more information on the pipe operator, check out Wickham and Grolemund’s (2017) chapter on pipes: https://r4ds.had.co.nz/pipes.html.

The second technique for applying the join function takes a more traditional approach in that it involves nested functions being nested parenthetically. If you don’t want to learn how to use pipes (or would like to learn how to use them at a later date), feel free to skip to the section below called Without Pipe.

17.2.4.1 With Pipe

Using the pipe (%>%) operator technique, let’s begin with what is referred to as an inner join by doing the following:

  1. Use the <- operator to name the joined (merged) data frame that we will create using the one of the dplyr join functions. For this example, I name the new joined data frame mergeddf, which is completely arbitrary; you could name it whatever you would like. Make sure you put the name of the new data frame object to the left of the <- operator.
  2. To the right of the <- operator, type the name of the first data frame, which we named personaldata, followed by the pipe (%>%) operator. This will “pipe” our data frame into the subsequent function.
  3. On the same line or on the next line, type the inner_join function, and within the parentheses as the first argument, type the name of the second data frame, which we called performancedata. As the second argument, use the by= argument to indicate the name of the key variable, which in this example is id; make sure the key variable is in quotation marks (" "), and remember, object and variable names in R are case and space sensitive.
# Inner join (with pipe)
mergeddf <- personaldata %>% inner_join(performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 5 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 2   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 3   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 4   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 5   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

Now, let’s revisit the original data frame objects that we read in initially.

# Print the first original data frame
personaldata
## # A tibble: 8 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   154 McDonald   Ronald    1/9/2016  male  
## 2   155 Smith      John      1/9/2016  male  
## 3   165 Doe        Jane      1/4/2016  female
## 4   125 Franklin   Benjamin  1/5/2016  male  
## 5   111 Newton     Isaac     1/9/2016  male  
## 6   198 Morales    Linda     1/7/2016  female
## 7   201 Providence Cindy     1/9/2016  female
## 8   282 Legend     John      1/9/2016  male
# Print the second original data frame
performancedata
## # A tibble: 6 × 5
##      id perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1   153     3.9     4.8     4.9     5  
## 2   125     2.1     1.9     2.1     2.3
## 3   111     3.3     3.3     3.4     3.3
## 4   198     4.9     4.5     4.4     4.8
## 5   201     1.2     1.1     1       1  
## 6   282     2.2     2.3     2.4     2.5

In the output, first, note how all of the variables from the original data frames (i.e., personaldata, performancedata) are represented in the merged data frame (i.e., mergeddf). Second, note how the cases are matched by the id key variable. Third, note that the personaldata data frame has 8 cases, the performancedata data frame has 6 cases, and the mergeddf data frame has 6 cases. By default, the merge function performs an inner join and retains only those matched cases that have data in both data frames. Because cases whose id values were 154, 155, and 165 had data in personaldata but not performancedata and because the case with an id value equal to 153 was in performancedata but not personaldata, only the 5 cases that had available data in both data frames were retained.

To perform what is referred to as a full join in which we retain all cases and available data, we simply swap out the inner_join function from our previous code with the full_join function.

# Full join (with pipe)
mergeddf <- personaldata %>% full_join(performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 9 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA  
## 2   155 Smith      John      1/9/2016  male      NA      NA      NA      NA  
## 3   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 4   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 5   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 6   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 7   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 8   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5
## 9   153 <NA>       <NA>      <NA>      <NA>       3.9     4.8     4.9     5

Note how the full_join function retains all available cases that had available data in at least one of the data frames, which in this example is 9 cases. When in doubt, I recommend using the full_join function to retain all available data.

To perform what is referred to as a left join in which we retain only those cases with data available in the first (left, x) data frame (personaldata), we use the left_join function instead, while keeping the rest of the previous code the same.

# Left join (with pipe)
mergeddf <- personaldata %>% left_join(performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 8 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA  
## 2   155 Smith      John      1/9/2016  male      NA      NA      NA      NA  
## 3   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 4   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 5   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 6   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 7   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 8   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

Note how the left_join function retains only those cases for which the first (left, x) data frame (i.e., personaldata) has complete data, which in this case happens to be 8 cases. Notably absent is the case associated with id equal to 153 because the first (left, x) data frame (i.e., personaldata) lacked that case. An NA appears for each case from the second (right, y) data frame that contained missing values on variables from that data frame.

To perform what is referred to as a right join in which we retain only those cases with data available in the second (right, y) data frame (performancedata), we will use the right_join function instead, while keeping the rest of the previous code the same.

# Right join (with pipe)
mergeddf <- personaldata %>% right_join(performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 6 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 2   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 3   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 4   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 5   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5
## 6   153 <NA>       <NA>      <NA>      <NA>       3.9     4.8     4.9     5

Note how the right_join function retains only those cases for which the joined (second, right, y) data frame (i.e., performancedata) has complete data. Because the first (left, x) data frame lacks data for the case in which id is equal to 153, an NA appears for each case from the first data frame that contained missing values on variables from that data frame.

17.2.4.2 Without Pipe

In this section, I demonstrate the same dplyr join functions as above, except here I demonstrate how to specify the functions without the use of a pipe (%>%) operator.

Let’s begin with what is referred to as an inner join by doing the following:

  1. Use the <- operator to name the joined (merged) data frame that we will create using the one of the dplyr join functions. For this example, I name the new joined data frame mergeddf, which is completely arbitrary; you could name it whatever you would like. Make sure you put the name of the new data frame object to the left of the <- operator.
  2. To the right of the <- operator, type the name of the inner_join function. As the first argument within the parentheses, type the name of the first data frame, which we named personaldata. As the second argument, type the name of the second data frame we named performancedata. As the third argument, use the by= argument to indicate the name of the key variable, which in this example is id; make sure the key variable is in quotation marks (" "), and remember, object and variable names in R are case and space sensitive.
# Inner join (without pipe)
mergeddf <- inner_join(personaldata, performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 5 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 2   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 3   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 4   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 5   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

Now, let’s revisit the original data frame objects that we read in initially.

# Print the first original data frame
personaldata
## # A tibble: 8 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   154 McDonald   Ronald    1/9/2016  male  
## 2   155 Smith      John      1/9/2016  male  
## 3   165 Doe        Jane      1/4/2016  female
## 4   125 Franklin   Benjamin  1/5/2016  male  
## 5   111 Newton     Isaac     1/9/2016  male  
## 6   198 Morales    Linda     1/7/2016  female
## 7   201 Providence Cindy     1/9/2016  female
## 8   282 Legend     John      1/9/2016  male
# Print the second original data frame
performancedata
## # A tibble: 6 × 5
##      id perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1   153     3.9     4.8     4.9     5  
## 2   125     2.1     1.9     2.1     2.3
## 3   111     3.3     3.3     3.4     3.3
## 4   198     4.9     4.5     4.4     4.8
## 5   201     1.2     1.1     1       1  
## 6   282     2.2     2.3     2.4     2.5

In the output, first, note how all of the variables from the original data frames (i.e., personaldata, performancedata) are represented in the merged data frame (i.e., mergeddf). Second, note how the cases are matched by the id key variable. Third, note that the personaldata data frame has 8 cases, the performancedata data frame has 6 cases, and the mergeddf data frame has 6 cases. By default, the merge function performs an inner join and retains only those matched cases that have data in both data frames. Because cases whose id values were 154, 155, and 165 had data in personaldata but not performancedata and because the case with an id value equal to 153 was in performancedata but not personaldata, only the 5 cases that had available data in both data frames were retained.

To perform what is referred to as a full join in which we retain all cases and available data, we simply swap out the inner_join function from our previous code with the full_join function.

# Full join (without pipe)
mergeddf <- full_join(personaldata, performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 9 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA  
## 2   155 Smith      John      1/9/2016  male      NA      NA      NA      NA  
## 3   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 4   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 5   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 6   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 7   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 8   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5
## 9   153 <NA>       <NA>      <NA>      <NA>       3.9     4.8     4.9     5

Note how the full_join function retains all available cases that had available data in at least one of the data frames, which in this example is 9 cases. When in doubt, I recommend using the full_join function to retain all available data.

To perform what is referred to as a left join in which we retain only those cases with data available in the first (left, x) data frame (personaldata), we use the left_join function instead, while keeping the rest of the previous code the same.

# Left join (without pipe)
mergeddf <- left_join(personaldata, performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 8 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA  
## 2   155 Smith      John      1/9/2016  male      NA      NA      NA      NA  
## 3   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 4   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 5   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 6   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 7   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 8   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

Note how the left_join function retains only those cases for which the first (left, x) data frame (i.e., personaldata) has complete data, which in this case happens to be 8 cases. Notably absent is the case associated with id equal to 153 because the first (left, x) data frame (i.e., personaldata) lacked that case. An NA appears for each case from the second (right, y) data frame that contained missing values on variables from that data frame.

To perform what is referred to as a right join in which we retain only those cases with data available in the second (right, y) data frame (performancedata), we use the right_join function instead, while keeping the rest of the previous code the same.

# Right join (without pipe)
mergeddf <- right_join(personaldata, performancedata, by="id")

# Print the joined data frame
mergeddf
## # A tibble: 6 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 2   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 3   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 4   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 5   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5
## 6   153 <NA>       <NA>      <NA>      <NA>       3.9     4.8     4.9     5

Note how the right_join function retains only those cases for which the joined (second, right, y) data frame (i.e., performancedata) has complete data. Because the first (left, x) data frame lacks data for the case in which id is equal to 153, an NA appears for each case from the first data frame that contained missing values on variables from that data frame.

17.2.5 Vertical Join (Merge)

To perform a vertical join (merge), we will use the rbind function from base R, which stands for “row bind.” As a reminder, with a horizontal join, our focus is on joining variables (i.e., columns, fields) from two data frames containing overlapping cases (i.e., rows). In contrast, with a vertical join, our focus is on joining cases from data frames with the same variables.

To illustrate how to perform a vertical join, we take a slightly different approach than what we did with horizontal joins. Instead of reading in data files, we will create two “toy” employee demographic data frames with the exact same variables but different cases. We will use the data.frame function from base R to indicate that we wish to create a data frame object; we use the c (combine) function from base R to combine values into a vector; and we use the rep (replicate) function from base R to replicate the same value a specified number of times. Also note that the : operator, when used between two numbers, creates a vector of consecutive values, beginning with the first value and ending with the second. Please note, that using and understanding the data.frame, c, and rep functions is not consequential for understanding how to do a vertical merge; rather, I merely use these functions in this tutorial to create quick toy data frames that we can use to illustrate how to do a vertical join. For more information on the data.frame function and the c function, please refer to the chapter called Basic Features and Operations of the R Language.

# Create data frames with same variables but arbitrary values
df1 <- data.frame(id=c(1:6), age=c(21:26), sex=c(rep("male", 6)))
df2 <- data.frame(id=c(7:10), age=c(27:30), sex=c(rep("female", 4)))
# Print first data frame
df1
##   id age  sex
## 1  1  21 male
## 2  2  22 male
## 3  3  23 male
## 4  4  24 male
## 5  5  25 male
## 6  6  26 male
# Print second data frame
df2
##   id age    sex
## 1  7  27 female
## 2  8  28 female
## 3  9  29 female
## 4 10  30 female

Given that these two data frames (i.e., df1, df2) have the exact same variable names (id, age, and sex), we can easily perform a vertical join using the rbind function. To do so, enter the names of the two data frames as arguments, separated by a comma. Use the <- operator to name the merged data frame something, which for this case, I arbitrarily named it mergeddf2.

# Vertical merge
mergeddf2 <- rbind(df1, df2)

# Print the merged data frame
mergeddf2
##    id age    sex
## 1   1  21   male
## 2   2  22   male
## 3   3  23   male
## 4   4  24   male
## 5   5  25   male
## 6   6  26   male
## 7   7  27 female
## 8   8  28 female
## 9   9  29 female
## 10 10  30 female

Note how the two data frames are now “stacked” on one another. This was possible because they shared the same variables names and variables types (e.g., numeric and character).

17.2.6 Summary

Joining (merging) data frames in R is a useful practice. In this chapter, we learned how to perform a horizontal join using the right_join, left_join, inner_join, and full_join functions from the dplyr package. We also learned how to perform a vertical join using the rbind function from base R.

17.3 Chapter Supplement

In addition to the join functions from the dplyr package covered above, we can use the merge function from base R to perform a horizontal join. Because this function comes from base R, we do not need to install and access an additional package like we do with the join functions, which some may find advantageous.

17.3.1 Video Tutorial

In addition to the written chapter supplement provided below, you can follow along with the following video tutorial to learn more about how to horizontally join data two data frames using the merge function from base R.

Link to video tutorial: https://youtu.be/MLihEVEpJBg

17.3.2 Functions & Packages Introduced

Function Package
names base R
merge base R

17.3.3 Initial Steps

If required, please refer to the Initial Steps section from this chapter for more information on these initial steps. Please note, however, that we are using two slightly different data files in this supplement, which will simply and clarify some of the different types of merges (joins) that we’ll go over.

# Set your working directory
setwd("H:/RWorkshop")
# Access readr package
library(readr)

# Read data and name data frame (tibble) objects
# Note that these data files are different than the
# ones we used in the main part of the chapter
personaldata <- read_csv("PersonalData.csv")
## Rows: 8 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): lastname, firstname, startdate, gender
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
performancedata <- read_csv("PerformanceData.csv")
## Rows: 6 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): id, perf_q1, perf_q2, perf_q3, perf_q4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) objects
names(personaldata)
## [1] "id"        "lastname"  "firstname" "startdate" "gender"
names(performancedata)
## [1] "id"      "perf_q1" "perf_q2" "perf_q3" "perf_q4"
# Print data frame (tibble) objects
print(personaldata)
## # A tibble: 8 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   153 Sanchez    Alejandro 1/1/2016  man   
## 2   154 McDonald   Ronald    1/9/2016  man   
## 3   155 Smith      John      1/9/2016  man   
## 4   165 Doe        Jane      1/4/2016  woman 
## 5   111 Newton     Isaac     1/9/2016  man   
## 6   198 Morales    Linda     1/7/2016  woman 
## 7   201 Providence Cindy     1/9/2016  woman 
## 8   282 Legend     John      1/9/2016  man
print(performancedata)
## # A tibble: 6 × 5
##      id perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1   153     3.9     4.8     4.9     5  
## 2   125     2.1     1.9     2.1     2.3
## 3   111     3.3     3.3     3.4     3.3
## 4   198     4.9     4.5     4.4     4.8
## 5   201     1.2     1.1     1       1  
## 6   282     2.2     2.3     2.4     2.5

17.3.4 merge Function from Base R

We will use the merge function to horizontally match cases from the personaldata and performancedata data frames using id as a key variable. To identify what the key variable is, let’s use the names function from base R to print the list of variable names from the two data frames, which we already did above. Nevertheless, let’s call up those variable names once more. Simply enter the name of the data frame as a parenthetical argument in the names function.

# Print variable names from first data frame
names(personaldata)
## [1] "id"        "lastname"  "firstname" "startdate" "gender"
# Print variable names from second data frame
names(performancedata)
## [1] "id"      "perf_q1" "perf_q2" "perf_q3" "perf_q4"

As you can see in the variable names listed above, the id variable is common to both data frames, and thus it will serve as our key variable.

Let’s begin with what is referred to as an inner join:

  1. Use the <- operator to name the joined data frame that we create using the merge function. For this example, I name the new joined data frame mergeddf, which is completely arbitrary; you could name it whatever you would like. Type the name of the new joined data frame to the left of the <- operator.
  2. To the right of the <- operator, type the name of the merge function. Within the merge function parentheses, we will provide the arguments needed to make this join a reality. First, enter the name of one of the data frames (e.g., personaldata), followed by a comma. Second, enter the name of of the other data frame (e.g., performancedata), followed by a comma. Third, use the by= argument to indicate the name of the key variable (e.g., id); make sure the key variable is in quotation marks (" "), and remember, object and variable names in R are case and space sensitive.
# Inner join
mergeddf <- merge(personaldata, performancedata, by="id")

# Print the joined data frame
mergeddf
##    id   lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
## 1 111     Newton     Isaac  1/9/2016    man     3.3     3.3     3.4     3.3
## 2 153    Sanchez Alejandro  1/1/2016    man     3.9     4.8     4.9     5.0
## 3 198    Morales     Linda  1/7/2016  woman     4.9     4.5     4.4     4.8
## 4 201 Providence     Cindy  1/9/2016  woman     1.2     1.1     1.0     1.0
## 5 282     Legend      John  1/9/2016    man     2.2     2.3     2.4     2.5

Now, let’s revisit the original data frame objects that we read in initially.

# Print the first original data frame
personaldata
## # A tibble: 8 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   153 Sanchez    Alejandro 1/1/2016  man   
## 2   154 McDonald   Ronald    1/9/2016  man   
## 3   155 Smith      John      1/9/2016  man   
## 4   165 Doe        Jane      1/4/2016  woman 
## 5   111 Newton     Isaac     1/9/2016  man   
## 6   198 Morales    Linda     1/7/2016  woman 
## 7   201 Providence Cindy     1/9/2016  woman 
## 8   282 Legend     John      1/9/2016  man
# Print the second original data frame
performancedata
## # A tibble: 6 × 5
##      id perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1   153     3.9     4.8     4.9     5  
## 2   125     2.1     1.9     2.1     2.3
## 3   111     3.3     3.3     3.4     3.3
## 4   198     4.9     4.5     4.4     4.8
## 5   201     1.2     1.1     1       1  
## 6   282     2.2     2.3     2.4     2.5

In the output, first, note how all of the variables from the original data frames (i.e., personaldata, performancedata) are represented in the merged data frame (i.e., mergeddf). Second, note how the cases are matched by the id key variable. Third, note that the personaldata data frame has 8 cases, the performancedata data frame has 6 cases, and the mergeddf data frame has 6 cases. By default, the merge function performs an inner join and retains only those matched cases that have data in both data frames. Because cases whose id values were 154, 155, and 165 had data in personaldata but not performancedata and because the case with an id value equal to 153 was in performancedata but not personaldata, only the 5 cases that had available data in both data frames were retained.

To perform what is referred to as a full join in which we retain all cases and available data, we can add the all= argument to our previous code and specify the logical value TRUE.

# Full join
mergeddf <- merge(personaldata, performancedata, by="id", all=TRUE)

# Print the joined data frame
mergeddf
##    id   lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
## 1 111     Newton     Isaac  1/9/2016    man     3.3     3.3     3.4     3.3
## 2 125       <NA>      <NA>      <NA>   <NA>     2.1     1.9     2.1     2.3
## 3 153    Sanchez Alejandro  1/1/2016    man     3.9     4.8     4.9     5.0
## 4 154   McDonald    Ronald  1/9/2016    man      NA      NA      NA      NA
## 5 155      Smith      John  1/9/2016    man      NA      NA      NA      NA
## 6 165        Doe      Jane  1/4/2016  woman      NA      NA      NA      NA
## 7 198    Morales     Linda  1/7/2016  woman     4.9     4.5     4.4     4.8
## 8 201 Providence     Cindy  1/9/2016  woman     1.2     1.1     1.0     1.0
## 9 282     Legend      John  1/9/2016    man     2.2     2.3     2.4     2.5

Note how the full_join function retains all available cases that had available data in at least one of the data frames, which in this example is 9 cases. When in doubt, I recommend using the full_join function to retain all available data.

To perform what is referred to as a left join in which we retain only those cases with data available in the first (left, x) data frame (personaldata), we use the all.x=TRUE argument instead.

# Left join 
mergeddf <- merge(personaldata, performancedata, by="id", all.x=TRUE)

# Print the joined data frame
mergeddf
##    id   lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
## 1 111     Newton     Isaac  1/9/2016    man     3.3     3.3     3.4     3.3
## 2 153    Sanchez Alejandro  1/1/2016    man     3.9     4.8     4.9     5.0
## 3 154   McDonald    Ronald  1/9/2016    man      NA      NA      NA      NA
## 4 155      Smith      John  1/9/2016    man      NA      NA      NA      NA
## 5 165        Doe      Jane  1/4/2016  woman      NA      NA      NA      NA
## 6 198    Morales     Linda  1/7/2016  woman     4.9     4.5     4.4     4.8
## 7 201 Providence     Cindy  1/9/2016  woman     1.2     1.1     1.0     1.0
## 8 282     Legend      John  1/9/2016    man     2.2     2.3     2.4     2.5

Note how the left join retains only those cases for which the first (left, x) data frame (i.e., personaldata) has complete data, which in this case happens to be 8 cases. Notably absent is the case associated with id equal to 153 because the first (left, x) data frame (i.e., personaldata) lacked that case. An NA appears for each case from the second (right, y) data frame that contained missing values on variables from that data frame.

To perform what is referred to as a right join in which we retain only those cases with data available in the second (right, y) data frame (performancedata), we use the all.y=TRUE argument instead.

# Right join 
mergeddf <- merge(personaldata, performancedata, by="id", all.y=TRUE)

# Print the joined data frame
mergeddf
##    id   lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
## 1 111     Newton     Isaac  1/9/2016    man     3.3     3.3     3.4     3.3
## 2 125       <NA>      <NA>      <NA>   <NA>     2.1     1.9     2.1     2.3
## 3 153    Sanchez Alejandro  1/1/2016    man     3.9     4.8     4.9     5.0
## 4 198    Morales     Linda  1/7/2016  woman     4.9     4.5     4.4     4.8
## 5 201 Providence     Cindy  1/9/2016  woman     1.2     1.1     1.0     1.0
## 6 282     Legend      John  1/9/2016    man     2.2     2.3     2.4     2.5

Note how the right join retains only those cases for which the joined (second, right, y) data frame (i.e., performancedata) has complete data. Because the first (left, x) data frame lacks data for the case in which id is equal to 153, an NA appears for each case from the first data frame that contained missing values on variables from that data frame.

References

Bache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A Forward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Visualize, Model, Transform, Tidy, and Import Data. Sebastopol, California: O’Reilly Media, Inc. https://r4ds.had.co.nz/n.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2022. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.