Chapter 18 Filtering (Subsetting) Data

In this chapter, we will learn how to filter (subset) cases from a data frame and how to select or remove variables from a data frame. We’ll begin with a conceptual overview of filtering (subsetting) and variable selection/removal. In this chapter, we’ll use the terms “filter” and “subset” interchangeably.

18.1 Conceptual Overview

Filtering data (i.e., subsetting data) is an important data-management process, as it allows us to:

  • Select or remove a subset of cases from a data frame based on their scores on one or more variables;
  • Select or remove a subset of variables from a data frame.

In this section, we will review logical operators, as it is through the application of logical operators that we will ultimately filter (subset) cases from a data frame.

18.1.1 Review of Logical Operators

When our goal is to select or remove a subset of cases (i.e., observations) from a data frame, we typically do so by applying logical operators. You may already be comfortable with the use of logical operators and expressions like “greater than” (\(>\)), “less than” (\(<\)), “equal to” (\(=\)), “greater than or equal to” (\(\ge\)), and “less than or equal to” (\(\le\)), but you may be less comfortable with the use of the logical “OR” and the logical “AND”. Thus, before we start working with data, let’s do a quick review.

When we wish to apply a single logical statement, our job is relatively straightforward. To keep things simple, let’s focus on a single vector, and we’ll call that vector \(X\). For example, we might choose to select only those cases with scores on \(X\) that are “greater than or equal to” (\(\ge\) or \(>\)\(=\)) 3, thereby retaining scores that are equal to or greater than 3, as highlighted in blue below. In other words, we select only those cases for which their score on \(X\) would be true given the logical statement \(X \ge 3\).

The vector X has seven possible scores, ranging from 1 to 7. If we apply the logical statement that X is “greater than or equal to” (\ge) 3, we select only those cases with scores that are equal to or greater than 3 and up to the maximum possible score of 7.
The vector \(X\) has seven possible scores, ranging from 1 to 7. If we apply the logical statement that \(X\) is “greater than or equal to” (\(\ge\)) 3, we select only those cases with scores that are equal to or greater than 3 and up to the maximum possible score of 7.

As another example, we might choose to select only those cases with scores on \(X\) that are “less than or equal to” (\(\le\) or \(<\)\(=\)) 5, thereby retaining scores that are equal to or less than 5, as highlighted in red below. In other words, we select only those cases for which their score on \(X\) would be true given the logical statement \(X \le 5\).

The vector X has seven possible scores, ranging from 1 to 7. If we apply the logical statement that X is “greater than or equal to” (\ge) 3, we select only those cases with scores that are equal to or greater than 3 and down to the minimum possible score of 1.
The vector \(X\) has seven possible scores, ranging from 1 to 7. If we apply the logical statement that \(X\) is “greater than or equal to” (\(\ge\)) 3, we select only those cases with scores that are equal to or greater than 3 and down to the minimum possible score of 1.

If we apply the logical “OR” operator, things get a bit more interesting. The logical “OR” is used when our objective is to select cases based on two logical statements, and if either logical statement is true for a given case, then that case is selected. The logical “OR” is consistent with idea of a mathematical union (\(\cup\)) from Set Theory. As an example, let’s combine the two logical statements from above with the logical “OR” operator. We retain those cases for which their score on \(X\) is “greater than or equal to” 3 or “less than or equal to” 5. Given that \(X\) has possible scores ranging from 1-7, below, we see that this logic retains all cases, as all scores (1-7) would satisfy one or both of the logical statements.

The vector X has seven possible scores, ranging from 1 to 7. If we wish to retain those cases for which X is “greater than or equal to” 3 or “less than or equal to” 5, we select those cases that satisfy either (or both) logical statements.
The vector \(X\) has seven possible scores, ranging from 1 to 7. If we wish to retain those cases for which \(X\) is “greater than or equal to” 3 or “less than or equal to” 5, we select those cases that satisfy either (or both) logical statements.

Using the same to logical statements above, let’s replace the logical “OR” with the logical “AND”. The logical “AND” is used when our objective is to select cases based on two logical statements, and if both logical statements are true for a given case, then that case is selected. The logical “AND” is consistent with idea of a mathematical intersection (\(\cap\)) from Set Theory. In this example, we retain only those cases for which their score on \(X\) is “greater than or equal to” 3 and “less than or equal to” 5. Given that \(X\) has possible scores ranging from 1-7, below, we see that this logic retain only cases with scores within the range 3-5, as only scores of 3, 4, or 5 would satisfy both of the logical statements.

The vector X has seven possible scores, ranging from 1 to 7. If we wish to retain those cases for which X is “greater than or equal to” 3 and “less than or equal to” 5, we select only those cases that satisfy both logical statements.
The vector \(X\) has seven possible scores, ranging from 1 to 7. If we wish to retain those cases for which \(X\) is “greater than or equal to” 3 and “less than or equal to” 5, we select only those cases that satisfy both logical statements.

18.2 Tutorial

This chapter’s tutorial demonstrates how to filter (subset) cases from a data frame and how to select or remove variables from a data frame.

18.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below. Both versions of the tutorial will show you how to filter (subset) data with or without the pipe (%>%) operator. If you’re unfamiliar with the pipe operator, no need to worry: I provide a brief explanation and demonstration regarding their purpose in both versions of the tutorial. Finally, please note that in the chapter supplement you have an opportunity to learn how to filter (subset) cases and select/remove variables using a function from base R, which you may find more intuitive or preferable for other reasons.

Link to video tutorial: https://youtu.be/izVcbPmu0D0

18.2.2 Functions & Packages Introduced

Function Package
str base R
filter dplyr
c base R
as.Date base R
select dplyr
subset base R

18.2.3 Initial Steps

If you haven’t already, save the files called “PersData.csv” and “PerfData.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data files called “PersData.csv” and “PerfData.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2024). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)

# Read data and name data frame (tibble) objects
personaldata <- read_csv("PersData.csv")
## Rows: 9 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): lastname, firstname, startdate, gender
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
performancedata <- read_csv("PerfData.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): id, perf_q1, perf_q2, perf_q3, perf_q4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) objects
names(personaldata)
## [1] "id"        "lastname"  "firstname" "startdate" "gender"
names(performancedata)
## [1] "id"      "perf_q1" "perf_q2" "perf_q3" "perf_q4"
# Print data frame (tibble) objects
print(personaldata)
## # A tibble: 9 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   153 Sanchez    Alejandro 1/1/2016  male  
## 2   154 McDonald   Ronald    1/9/2016  male  
## 3   155 Smith      John      1/9/2016  male  
## 4   165 Doe        Jane      1/4/2016  female
## 5   125 Franklin   Benjamin  1/5/2016  male  
## 6   111 Newton     Isaac     1/9/2016  male  
## 7   198 Morales    Linda     1/7/2016  female
## 8   201 Providence Cindy     1/9/2016  female
## 9   282 Legend     John      1/9/2016  male
print(performancedata)
## # A tibble: 6 × 5
##      id perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1   153     3.9     4.8     4.9     5  
## 2   125     2.1     1.9     2.1     2.3
## 3   111     3.3     3.3     3.4     3.3
## 4   198     4.9     4.5     4.4     4.8
## 5   201     1.2     1.1     1       1  
## 6   282     2.2     2.3     2.4     2.5

As you can see from the output generated in your console, on the one hand, the personaldata data frame object contains basic employee demographic information. The variable names include: id, lastname, firstname, startdate, and gender. On the other hand, the personaldata data frame object contains the same id unique identifier variable as the personaldata data frame object, but instead of employee demographic information, this data frame object includes variables associated with quarterly employee performance: perf_q1, perf_q2, perf_q3, and perf_q4.

To make this chapter more interesting (and for the sake of practice), let’s use the full_join function from dplyr (Wickham et al. 2023) to join (merge) the two data frames we just read in (personaldata, performancedata) using the id variable as the key variable. Let’s arbitrarily name the new joined (merged) data frame mergeddf using the <- operator. For more information on joining data, check out the chapter called Joining (Merging) Data.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("dplyr")
# Access package
library(dplyr)
# Full join (without pipe)
mergeddf <- full_join(personaldata, performancedata, by="id")

# Print joined (merged) data frame object
print(mergeddf)
## # A tibble: 9 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   153 Sanchez    Alejandro 1/1/2016  male       3.9     4.8     4.9     5  
## 2   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA  
## 3   155 Smith      John      1/9/2016  male      NA      NA      NA      NA  
## 4   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 5   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 6   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 7   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 8   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 9   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

Now we have a joined data frame called mergeddf!

18.2.4 Filter Cases from Data Frame

Sometimes we want to select only a subset of cases from a data frame or table. There are different functions that can achieve this end. For example, the subset function filter from base R will do the trick. With that said, the dplyr package (Wickham et al. 2023) offers the filter function which has some advantages (e.g., faster with larger amounts of data), and thus, we will focus on the filter function in this chapter. If you would like to learn how to use the subset function from base R, check out the chapter supplement called Joining (Merging) Data.

In order to properly filter data by cases, we need to know the respective types (classes) of the variables in the data frame. Perhaps the quickest way to find out the type (class) of each variable in the data frame is to use the str (structure) function from base R, and the function’s parentheses, just enter the name of the data frame (mergeddf).

# Determine class of variables
str(mergeddf)
## spc_tbl_ [9 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id       : num [1:9] 153 154 155 165 125 111 198 201 282
##  $ lastname : chr [1:9] "Sanchez" "McDonald" "Smith" "Doe" ...
##  $ firstname: chr [1:9] "Alejandro" "Ronald" "John" "Jane" ...
##  $ startdate: chr [1:9] "1/1/2016" "1/9/2016" "1/9/2016" "1/4/2016" ...
##  $ gender   : chr [1:9] "male" "male" "male" "female" ...
##  $ perf_q1  : num [1:9] 3.9 NA NA NA 2.1 3.3 4.9 1.2 2.2
##  $ perf_q2  : num [1:9] 4.8 NA NA NA 1.9 3.3 4.5 1.1 2.3
##  $ perf_q3  : num [1:9] 4.9 NA NA NA 2.1 3.4 4.4 1 2.4
##  $ perf_q4  : num [1:9] 5 NA NA NA 2.3 3.3 4.8 1 2.5
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   lastname = col_character(),
##   ..   firstname = col_character(),
##   ..   startdate = col_character(),
##   ..   gender = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Note that the id variable is of type integer; the lastname, firstname, startdate, and gender variables are of type character (string); and the perf_q4, perf_q4, perf_q4, and perf_q4 variables are of type numeric. The variable type will have important implications for how use use the filter function from dplyr.

In R, we can apply any one of the following logical operators when filtering our data:

Logical Operator Definition
< “less than”
> “greater than”
<= “less than or equal to”
>= “greater than or equal to”
== “equal to”
!= “not equal to”
| “or”
& “and”
! “not”

To get started, install and access the dplyr package.

# Install package
install.packages("dplyr")
# Access package
library(dplyr)

I will demonstrate two approaches for applying the filter function from dplyr. The first option uses “pipe(s),” which in R is represented by the %>% operator. The pipe operator comes from a package called magrittr (Bache and Wickham 2022), on which the dplyr is partially dependent. In short, a pipe allows one to more efficiently code/script and to improve the readability of the code/script under certain conditions. Specifically, a pipe forwards the result or value of one object or expression to a subsequent function. In doing so, one can avoid writing functions in which other functions are nested parenthetically. The second option is more traditional and lacks the efficiency and readability of pipes. You can use either approach, and if don’t you want to use pipes, skip to the section below called Without Pipes. For more information on the pipe operator, check out this link: https://r4ds.had.co.nz/pipes.html.

18.2.4.1 With Pipes

Using an approach with pipes, first, use the <- operator to name the filtered data frame that we will create. For this example, I name the new joined data frame filterdf; you could name it whatever you would like. Second, type the name of the first data frame, which we named mergeddf (see above), followed by the pipe (%>%) operator. This will “pipe” our data frame into the subsequent function. Third, either on the same line or on the next line, type the filter function. Fourth, within the function parentheses, type the name of the variable we wish to filter the data frame by, which in this example is gender. Fourth, type a logical operator, which for this example is ==. Fifth, type a value for the filter variable, which in this example is “female”; because the gender variable is of type character, we need to put quotation marks (" ") around the value of the variable that we wish to filter by. Remember, object names in R are case and space sensitive; for instance, gender is different from Gender, and “female” is different from “Female”.

# Filter in by gender with pipe
filterdf <- mergeddf %>% filter(gender=="female")

# Print filtered data frame
filterdf
## # A tibble: 3 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 2   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 3   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1

Note how the data frame above contains only those cases with “female” as their gender variable designation. The filter worked as expected.

Alternatively, we could filter out those cases in which gender is equal to “female” using the != (not equal to) logical operator.

# Filter out by gender with pipe
filterdf <- mergeddf %>% filter(gender!="female")

# Print filtered data frame
filterdf
## # A tibble: 6 × 9
##      id lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>    <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   153 Sanchez  Alejandro 1/1/2016  male       3.9     4.8     4.9     5  
## 2   154 McDonald Ronald    1/9/2016  male      NA      NA      NA      NA  
## 3   155 Smith    John      1/9/2016  male      NA      NA      NA      NA  
## 4   125 Franklin Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 5   111 Newton   Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 6   282 Legend   John      1/9/2016  male       2.2     2.3     2.4     2.5

Note how cases with gender equal to “female” are no longer in the data frame, while every other case is retained.

Let’s now filter by a variable of type numeric (or integer). Specifically, let’s select those cases in which the perf_q2 variable is greater than (>) 4.0. Because the perf_q2 variable is of type numeric, we don’t use quotation marks (" ") around the value we wish to filter by, which in this case is 4.0.

# Filter by perf_q2 with pipe
filterdf <- mergeddf %>% filter(perf_q2>4.0)

# Print filtered data frame
filterdf
## # A tibble: 2 × 9
##      id lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>    <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   153 Sanchez  Alejandro 1/1/2016  male       3.9     4.8     4.9     5  
## 2   198 Morales  Linda     1/7/2016  female     4.9     4.5     4.4     4.8

If we wish to filter by two variables, we can apply the logical “or” (|) operator or “and” (&) operator. First, let’s select those cases in which either gender is equal to “female” or perf_q2 is greater than 4.0 using the “or” (|) operator.

# Filter by gender or perf_q2 with pipe
filterdf <- mergeddf %>% filter(gender=="female" | perf_q2>4.0)

# Print filtered data frame
filterdf
## # A tibble: 4 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   153 Sanchez    Alejandro 1/1/2016  male       3.9     4.8     4.9     5  
## 2   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 3   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 4   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1

Watch what happens if we apply the logical “and” (&) operator with the same syntax as above.

# Filter by gender and perf_q2 with pipe
filterdf <- mergeddf %>% filter(gender=="female" & perf_q2>4.0)

# Print filtered data frame
filterdf
## # A tibble: 1 × 9
##      id lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>    <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   198 Morales  Linda     1/7/2016  female     4.9     4.5     4.4     4.8

We can also use the logical “or” (|) operator to select two values of the same variable.

# Filter by two values of firstname with pipe
filterdf <- mergeddf %>% filter(firstname=="John" | firstname=="Jane")

# Print filtered data frame
filterdf
## # A tibble: 3 × 9
##      id lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>    <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   155 Smith    John      1/9/2016  male      NA      NA      NA      NA  
## 2   165 Doe      Jane      1/4/2016  female    NA      NA      NA      NA  
## 3   282 Legend   John      1/9/2016  male       2.2     2.3     2.4     2.5

Or we can select two ranges of values from the same variable using the logical “or” (|) operator, assuming the variable is of type numeric, integer, or date.

# Filter by two ranges of values of perf_q1 with pipe
filterdf <- mergeddf %>% filter(perf_q1<=2.5 | perf_q1>=4.0)

# Print filtered data frame
filterdf
## # A tibble: 4 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 2   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 3   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 4   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

The filter function can also be used to remove multiple specific cases (such as from a unique identifier variable), which might be useful when you’ve identified outliers that need to be removed. As a first step, identify a vector of values that need to be removed. In this example, let’s pretend that cases with id variable values of 198 and 201 no longer work for this company, so they should be removed from the sample. To create a vector of these two values, use the c function like this: c(198,201). Next, because you are now filtering by a vector, you will need to use the %in% operator, which is an operator that instructs R to go through each value of the filter variable (id) and identify instances of 198 and 201 (c(198,201)); if the values match, then those cases are retained. However, because we entered ! in front of the filter variable, this actually reverses our logic and instructs R to remove those cases in which a value of the filter variable matches a value contained in the vector.

# Filter out id of 198 and 201 with pipe
filterdf <- mergeddf %>% filter(!id %in% c(198,201))

# Print filtered data frame
filterdf
## # A tibble: 7 × 9
##      id lastname firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>    <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   153 Sanchez  Alejandro 1/1/2016  male       3.9     4.8     4.9     5  
## 2   154 McDonald Ronald    1/9/2016  male      NA      NA      NA      NA  
## 3   155 Smith    John      1/9/2016  male      NA      NA      NA      NA  
## 4   165 Doe      Jane      1/4/2016  female    NA      NA      NA      NA  
## 5   125 Franklin Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 6   111 Newton   Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 7   282 Legend   John      1/9/2016  male       2.2     2.3     2.4     2.5

Note that in the output above cases with id variable values equal to 198 and 201 are no longer present.

If you remove the ! in front of the filter variable, only cases 198 and 201 are retained.

# Filter in id of 198 and 201 with pipe
filterdf <- mergeddf %>% filter(id %in% c(198,201))

# Print filtered data frame
filterdf
## # A tibble: 2 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   198 Morales    Linda     1/7/2016  female     4.9     4.5     4.4     4.8
## 2   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1

And if you wanted to remove just a single case, you could use the unique identifier variable (id) and the following script/code.

# Filter out id of 198 with pipe
filterdf <- mergeddf %>% filter(id!=198)

# Print filtered data frame
filterdf
## # A tibble: 8 × 9
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>
## 1   153 Sanchez    Alejandro 1/1/2016  male       3.9     4.8     4.9     5  
## 2   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA  
## 3   155 Smith      John      1/9/2016  male      NA      NA      NA      NA  
## 4   165 Doe        Jane      1/4/2016  female    NA      NA      NA      NA  
## 5   125 Franklin   Benjamin  1/5/2016  male       2.1     1.9     2.1     2.3
## 6   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3
## 7   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1  
## 8   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5

When working with variables of type Date, things can get a bit trickier. When we applied the str function from base R (see above), we found that the startdate variable was read in and joined as a character variable as opposed to a date variable. As such, we need to convert the startdate variable using the as.Date function from base R. First, type the name of the data frame object (mergeddf), followed by the $ operator and the name of whatever you want to call the new variable (startdate2); remember, the $ operator tells R that a variable belongs to (or will belong to) a particular data frame. Second, type the <- operator. Third, type the name of the as.Date function. Fourth, in the function parentheses, as the first argument, enter the as.character function with the name of the data frame object (mergeddf), followed by the $ operator and the name the original variable (startdate) as the sole argument. Fifth, as the second argument in the as.Date function, type format="%m/%d/%Y" to indicate the format for the data variable; note that the capital Y in %Y implies a 4-digit year, whereas a lower case would imply a 2-digit year.

# Convert character startdate variable to the Date type startdate2 variable
mergeddf$startdate2 <- as.Date(as.character(mergeddf$startdate), format="%m/%d/%Y")

To verify that the new startdate2 variable is of type date, use the str function from base R, and enter the name of the data frame object (mergeddf) as the sole argument. As you will see, the new startdate2 variable is now of type Date.

# Verify that the startdate2 variable is now a variable of type Date
str(mergeddf)
## spc_tbl_ [9 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id        : num [1:9] 153 154 155 165 125 111 198 201 282
##  $ lastname  : chr [1:9] "Sanchez" "McDonald" "Smith" "Doe" ...
##  $ firstname : chr [1:9] "Alejandro" "Ronald" "John" "Jane" ...
##  $ startdate : chr [1:9] "1/1/2016" "1/9/2016" "1/9/2016" "1/4/2016" ...
##  $ gender    : chr [1:9] "male" "male" "male" "female" ...
##  $ perf_q1   : num [1:9] 3.9 NA NA NA 2.1 3.3 4.9 1.2 2.2
##  $ perf_q2   : num [1:9] 4.8 NA NA NA 1.9 3.3 4.5 1.1 2.3
##  $ perf_q3   : num [1:9] 4.9 NA NA NA 2.1 3.4 4.4 1 2.4
##  $ perf_q4   : num [1:9] 5 NA NA NA 2.3 3.3 4.8 1 2.5
##  $ startdate2: Date[1:9], format: "2016-01-01" "2016-01-09" "2016-01-09" "2016-01-04" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   lastname = col_character(),
##   ..   firstname = col_character(),
##   ..   startdate = col_character(),
##   ..   gender = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now we are ready to filter using the new startdate2 variable. When specify the value of the startdate2 variable by which you wish to filter by, make sure to use the as.Date function once more with the date (formatted as YYYY-MM-DD) in quotation marks (" ") as the sole argument. Here, I filter for those cases in which their startdate2 values are greater than 2016-01-07.

# Filter by startdate2 with pipe
filterdf <- mergeddf %>% filter(startdate2 > as.Date("2016-01-07"))

# Print filtered data frame
filterdf
## # A tibble: 5 × 10
##      id lastname   firstname startdate gender perf_q1 perf_q2 perf_q3 perf_q4 startdate2
##   <dbl> <chr>      <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl> <date>    
## 1   154 McDonald   Ronald    1/9/2016  male      NA      NA      NA      NA   2016-01-09
## 2   155 Smith      John      1/9/2016  male      NA      NA      NA      NA   2016-01-09
## 3   111 Newton     Isaac     1/9/2016  male       3.3     3.3     3.4     3.3 2016-01-09
## 4   201 Providence Cindy     1/9/2016  female     1.2     1.1     1       1   2016-01-09
## 5   282 Legend     John      1/9/2016  male       2.2     2.3     2.4     2.5 2016-01-09

18.2.4.2 Without Pipes

We can also filter using the filter function from the dplyr package without using the pipe (%>%) operator. Note how I simply move the name of the data frame object from before the pipe (%>%) operator to the first argument in the filter function. Everything else remains the same. For simplicity, I don’t display the output below as it is the same as the output as above using pipes. Your decision whether to use a pipe operator is completely up to you.

Let’s filter the mergeddf data frame object such that only those cases for which the gender variable is equal to “female” are retained. Note how we apply the equal to (==) logical operator. A table of logical operators is presented towards the beginning of this tutorial.

# Filter in by gender without pipe
filterdf <- filter(mergeddf, gender=="female")

# Print filtered data frame
filterdf

Now let’s filter out those cases in which gender is not equal to “female”. We apply the not equal to (!=) logical operator to do so.

# Filter in by gender without pipe
filterdf <- filter(mergeddf, gender!="female")

# Print filtered data frame
filterdf

Filter the data frame such that we retain those cases for which the perf_q2 variable is greater than (>) 4.0. Because the perf_q2 variable is numeric, we don’t put the value 4.0 in quotation marks.

# Filter by perf_q2 without pipe
filterdf <- filter(mergeddf, perf_q2>4.0)

# Print filtered data frame
filterdf

Using the logical “or” operator (|), select those cases for which gender is equal to “female” or for which perf_q2 is greater than 4.0.

# Filter by gender or perf_q2 without pipe
filterdf <- filter(mergeddf, gender=="female" | perf_q2>4.0)

# Print filtered data frame
filterdf

Using the logical “and” operator (&), select those cases for which gender is equal to “female” and for which perf_q2 is greater than 4.0. Note the difference in the resulting filtered data frame.

# Filter by gender and perf_q2 without pipe
filterdf <- filter(mergeddf, gender=="female" & perf_q2>4.0)

# Print filtered data frame
filterdf

Using the logical “or” operator (|), select those cases for which firstname is equal to “John” or for which firstname is equal to “Jane”. In other words, select those individuals whose names are either “John” or “Jane”.

# Filter by two values of firstname without pipe
filterdf <- filter(mergeddf, firstname=="John" | firstname=="Jane")

# Print filtered data frame
filterdf

Using the logical “or” operator (|), select the range of cases for which perf_q1 is less than equal to (<=) 2.5 or for which perf_q1 is greater than or equal (>=) to 4.0.

# Filter by two ranges of values of perf_q1 without pipe
filterdf <- filter(mergeddf, perf_q1<=2.5 | perf_q1>=4.0)

# Print filtered data frame
filterdf

The filter function can also be used to remove multiple specific cases (such as from a unique identifier variable), which might be useful when you’ve identified outliers that need to be removed. As a first step, identify a vector of values that need to be removed. In this example, let’s pretend that cases with id variable values of 198 and 201 no longer work for this company, so they should be removed from the sample. To create a vector of these two values, use the c function like this: c(198,201). Next, because you are now filtering by a vector, you will need to use the %in% operator, which is an operator that instructs R to go through each value of the filter variable (id) and identify instances of 198 and 201 (c(198,201)); if the values match, then those cases are retained. However, because we entered ! in front of the filter variable, this actually reverses our logic and instructs R to remove those cases in which a value of the filter variable matches a value contained in the vector.

# Filter out id of 198 and 201 without pipe
filterdf <- filter(mergeddf, !id %in% c(198,201))

# Print filtered data frame
filterdf

Or if you wish to retain only those cases for which the id variable is equal to 198 and 201, drop the not operator (!) from the previous script.

# Filter in id of 198 and 201 without pipe
filterdf <- filter(mergeddf, id %in% c(198,201))

# Print filtered data frame
filterdf

You can also drop specific cases one by one using the not equal to operator (!=) and the a unique identifier value associated with the case you wish to remove. We accomplish the same result as above but use two steps instead. Also, note that in the second step below, the new data frame object (filterdf) is used as the first argument because we want to retain the changes we made in the prior step (i.e., dropping case with id equal to 198).

# Filter in id of 198 without pipe
filterdf <- filter(mergeddf, id!=198)

# Filter in id of 201 without pipe
filterdf <- filter(filterdf, id!=201)

# Print filtered data frame
filterdf

When working with variables of type Date, things can get a bit trickier. When we applied the str function from base R (see above), we found that the startdate variable was read in and joined as a character variable as opposed to a date variable. As such, we need to convert the startdate variable using the as.Date function from base R. First, type the name of the data frame object (mergeddf), followed by the $ operator and the name of whatever you want to call the new variable (startdate2); remember, the $ operator tells R that a variable belongs to (or will belong to) a particular data frame. Second, type the <- operator. Third, type the name of the as.Date function. Fourth, in the function parentheses, as the first argument, enter the as.character function with the name of the data frame object (mergeddf), followed by the $ operator and the name the original variable (startdate) as the sole argument. Fifth, as the second argument in the as.Date function, type format="%m/%d/%Y" to indicate the format for the data variable; note that the capital Y in %Y implies a 4-digit year, whereas a lower case would imply a 2-digit year. To verify that the new startdate2 variable is of type date, on the next line, use the str function from base R, and enter the name of the data frame object (mergeddf) as the sole argument. As you will see, the new startdate2 variable is now of type Date.

# Convert character startdate variable to the date type startdate2 variable
mergeddf$startdate2 <- as.Date(as.character(mergeddf$startdate), format="%m/%d/%Y")

# Verify that the startdate2 variable is now a variable of type date
str(mergeddf)

Now we are ready to filter using the new startdate2 variable. When specify the value of the startdate2 variable by which you wish to filter by, make sure to use the as.Date function once more with the date (formatted as YYYY-MM-DD) in quotation marks (" ") as the sole argument. Here, I filter for those cases in which their startdate2 values are greater than 2016-01-07.

# Filter by startdate2 without pipe
filterdf <- filter(mergeddf, startdate2 > as.Date("2016-01-07"))

# Print filtered data frame
filterdf

18.2.5 Remove Single Variable from Data Frame

If you just need to remove a single variable from a data frame, using the NULL object in R in conjunction with the <- operator can designate which variable to drop. For example, if we wish to drop the startdate variable from the mergeddf data frame, we simply note that startdate belongs to mergeddf by joining them with $. Next, we set <- NULL adjacent to mergeddf$startdate to indicate that we wish to remove that variable from that data frame.

# Remove variable
mergeddf$startdate <- NULL

# Print updated data frame
mergeddf
## # A tibble: 9 × 9
##      id lastname   firstname gender perf_q1 perf_q2 perf_q3 perf_q4 startdate2
##   <dbl> <chr>      <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl> <date>    
## 1   153 Sanchez    Alejandro male       3.9     4.8     4.9     5   2016-01-01
## 2   154 McDonald   Ronald    male      NA      NA      NA      NA   2016-01-09
## 3   155 Smith      John      male      NA      NA      NA      NA   2016-01-09
## 4   165 Doe        Jane      female    NA      NA      NA      NA   2016-01-04
## 5   125 Franklin   Benjamin  male       2.1     1.9     2.1     2.3 2016-01-05
## 6   111 Newton     Isaac     male       3.3     3.3     3.4     3.3 2016-01-09
## 7   198 Morales    Linda     female     4.9     4.5     4.4     4.8 2016-01-07
## 8   201 Providence Cindy     female     1.2     1.1     1       1   2016-01-09
## 9   282 Legend     John      male       2.2     2.3     2.4     2.5 2016-01-09

18.2.6 Select Multiple Variables from Data Frame

If you wish to select multiple variables from a data frame (and remove all others), the select function from the dplyr package is quite useful and intuitive. Below, I demonstrate how to select multiple variables with and without pipes. If you don’t want to use pipes, feel free to skip down to the section called Without Pipes.

18.2.6.1 With Pipe

Using the pipe (%>%) operator, first, decide whether you want to override an existing data frame or create a new data frame based on our selection; here, I override the mergeddf data frame using the <- operator, which results in mergeddf <-. Second, type the name of the original data frame (mergeddf), followed by the pipe (%>%) operator. Third, type the name of the select function. Fourth, in the parentheses, list the names of the variables you wish to select as arguments; all variables that are not listed will be dropped. Here, we are selecting (to retain) the id, perf_q1, gender, lastname, and firstname variables. Note that the updated date frame includes the selected variables in the order in which you listed them.

# Select multiple variables with pipe
mergeddf <- mergeddf %>% select(id, perf_q1, gender, lastname, firstname)

# Print updated data frame
mergeddf
## # A tibble: 9 × 5
##      id perf_q1 gender lastname   firstname
##   <dbl>   <dbl> <chr>  <chr>      <chr>    
## 1   153     3.9 male   Sanchez    Alejandro
## 2   154    NA   male   McDonald   Ronald   
## 3   155    NA   male   Smith      John     
## 4   165    NA   female Doe        Jane     
## 5   125     2.1 male   Franklin   Benjamin 
## 6   111     3.3 male   Newton     Isaac    
## 7   198     4.9 female Morales    Linda    
## 8   201     1.2 female Providence Cindy    
## 9   282     2.2 male   Legend     John

18.2.6.2 Without Pipe

If you decide not to use the pipe (%>%) operator, the syntax remains almost the same except the name of the original data frame object (mergeddf) is moved from before the pipe (%>%) operator to the first argument in the select function. Everything else remains the same.

# Select multiple variables without pipe
mergeddf <- select(mergeddf, id, gender, lastname, firstname)

# Print updated data frame
mergeddf

18.2.7 Remove Multiple Variables from Data Frame

If you wish to remove multiple variables from a data frame, the select function from dplyr will work just fine. I demonstrate how to use this function with and without pipes. If you don’t want to use pipes, feel free to skip down to the section called Without Pipes.

18.2.7.1 With Pipe

Using the pipe (%>%) operator, first, decide whether you want to override an existing data frame or create a new data frame from the subset; here, I override the mergeddf data frame using the <- operator, which results in mergeddf <-. Second, type the name of the original data frame (mergeddf), followed by the pipe (%>%) operator. Third, enter the select function. Fourth, use the c (combine) function with - in front of it to note that you want to select all other variables except the ones listed in the c function.

# Remove multiple variables with pipe
mergeddf <- mergeddf %>% select(-c(lastname, firstname))

# Print updated data frame
mergeddf
## # A tibble: 9 × 3
##      id perf_q1 gender
##   <dbl>   <dbl> <chr> 
## 1   153     3.9 male  
## 2   154    NA   male  
## 3   155    NA   male  
## 4   165    NA   female
## 5   125     2.1 male  
## 6   111     3.3 male  
## 7   198     4.9 female
## 8   201     1.2 female
## 9   282     2.2 male

Removing a single variable can also be done using the select function. To do so, just list a single variable with - in front of it (as the sole argument) to indicate that you wish to drop that variable.

# Remove single variable with pipe
mergeddf <- mergeddf %>% select(-gender)

# Print updated data frame
mergeddf
## # A tibble: 9 × 2
##      id perf_q1
##   <dbl>   <dbl>
## 1   153     3.9
## 2   154    NA  
## 3   155    NA  
## 4   165    NA  
## 5   125     2.1
## 6   111     3.3
## 7   198     4.9
## 8   201     1.2
## 9   282     2.2

18.2.7.2 Without Pipe

If you decide not to use the pipe (%>%) operator, the syntax remains mostly the same except the name of the original data frame object (mergeddf) is moved from before the pipe (%>%) operator to the first argument in the select function. Everything else remains the same.

# Remove multiple variables without pipe
mergeddf <- select(mergeddf, -c(lastname, firstname))

# Print updated data frame
mergeddf

And here’s the non-pipe equivalent to removing a single variable using this approach.

# Remove single variable without pipe
mergeddf <- mergeddf %>% select(-gender)

# Print updated data frame
mergeddf

18.2.8 Summary

Applying filters and creating subsets of cases (rows) and variables (columns) from a data frame is an important part of data management. The dplyr package has two useful functions that can be used for these purposes: filter and select.

18.3 Chapter Supplement

In addition to the filter function from the dplyr package covered above, we can use the subset function from base R to subset cases from a data frame and to select cases from a data frame. Because this function comes from base R, we do not need to install and access an additional package like we do with the filter function, which some may prefer or find advantageous. Further, we can also apply the str_detect function from the stringr package with either the subset or the filter function to filter by a text pattern contained within a string (e.g., character) variable.

18.3.1 Video Tutorials

In addition to the written chapter supplement provided below, you can follow along with the following video tutorials to learn more about how to subset cases and select/remove variables using the subset function from base R.

Link to video tutorial: https://youtu.be/iM1e0wxUMrs

Link to video tutorial: https://youtu.be/kNpezEOx70g

18.3.2 Functions & Packages Introduced

Function Package
subset base R
str_detect stringr

18.3.3 Initial Steps

If required, please refer to the Initial Steps section from this chapter for more information on these initial steps.

# Set your working directory
setwd("H:/RWorkshop")
# Access readr package
library(readr)

# Read data and name data frame (tibble) object
# Note that we will only be reading in one
# data frame object for this supplement
# ones we used in the main part of the chapter
personaldata <- read_csv("PersonalData.csv")
## Rows: 8 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): lastname, firstname, startdate, gender
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) object
names(personaldata)
## [1] "id"        "lastname"  "firstname" "startdate" "gender"
# Print data frame (tibble) object
print(personaldata)
## # A tibble: 8 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   153 Sanchez    Alejandro 1/1/2016  man   
## 2   154 McDonald   Ronald    1/9/2016  man   
## 3   155 Smith      John      1/9/2016  man   
## 4   165 Doe        Jane      1/4/2016  woman 
## 5   111 Newton     Isaac     1/9/2016  man   
## 6   198 Morales    Linda     1/7/2016  woman 
## 7   201 Providence Cindy     1/9/2016  woman 
## 8   282 Legend     John      1/9/2016  man

18.3.4 subset Function from Base R

As an alternative to the filter function from the dplyr package, we will learn how to use the subset function from base R to filter cases from a data frame and to select or remove variables from a data frame.

18.3.4.1 Filter (Subset) Cases from Data Frame

We’ll begin by filtering cases from a data frame object. As a reminder, in R, we can apply any one of the following logical operators when filtering cases from a data frame or table object.

Logical Operator Definition
< “less than”
> “greater than”
<= “less than or equal to”
>= “greater than or equal to”
== “equal to”
!= “not equal to”
| “or”
& “and”
! “not”

To filter (subset) cases from a data frame object using the subset function from base R, we will take the following steps:

  1. We’ll use the <- assignment operator to name the filtered data frame that we are about to create. For this example, I chose to create a new data frame object (sub_personaldata), which I specified to the left of the <- operator; that being said, you could name the new data frame object whatever you would like.
  2. To the right of the <- operator, type the name of subset function from base R.
  • As the first argument in the function, type the name of the data frame we created above (personaldata).
  • As the second argument, type the name of the variable we wish to filter the data frame by, which in this example is gender followed by a logical (conditional) argument. For this example, we wish to to retain only those cases in which gender is equal to “woman”, and we do so using this logical argument gender == "woman". Because the gender variable is of type character, we need to put quotation marks (" ") around the variable value (i.e., text) that we wish to filter by. Remember, object names in R are case and space sensitive; for instance, gender is different from Gender, and “woman” is different from “Woman”.
# Filter (subset) by gender
sub_personaldata <- subset(personaldata, gender == "woman")

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 3 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   165 Doe        Jane      1/4/2016  woman 
## 2   198 Morales    Linda     1/7/2016  woman 
## 3   201 Providence Cindy     1/9/2016  woman

Note how the data frame above contains only those cases with “woman” as their gender variable designation. The filter worked as expected.

Alternatively, we could filter out (subset out) those cases in which gender is equal to “woman” using the != (not equal to) logical operator. Instead of overwriting the existing data frame object (personaldata), let’s assign the filtered data frame object to a new object that we’ll call sub_personaldata.

# Filter (subset) by gender
sub_personaldata <- subset(personaldata, gender != "woman")

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 5 × 5
##      id lastname firstname startdate gender
##   <dbl> <chr>    <chr>     <chr>     <chr> 
## 1   153 Sanchez  Alejandro 1/1/2016  man   
## 2   154 McDonald Ronald    1/9/2016  man   
## 3   155 Smith    John      1/9/2016  man   
## 4   111 Newton   Isaac     1/9/2016  man   
## 5   282 Legend   John      1/9/2016  man

Note how cases with gender equal to “woman” are no longer in the data frame, while every other case is retained.

Let’s now filter (subset) by a variable of type numeric/integer. Specifically, let’s select those cases in which the id variable is greater than (>) 154. Because the id variable consists of type numeric/integer, we won’t use quotation marks (" ") around the value we wish to filter by, which in this case is 154. As we did above, let’s assign the filtered data frame object to an object that we’ll call sub_personaldata; this will overwrite the existing object called sub_personaldata in our R Environment.

# Filter (subset) by id
sub_personaldata <- subset(personaldata, id > 154)

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 5 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   155 Smith      John      1/9/2016  man   
## 2   165 Doe        Jane      1/4/2016  woman 
## 3   198 Morales    Linda     1/7/2016  woman 
## 4   201 Providence Cindy     1/9/2016  woman 
## 5   282 Legend     John      1/9/2016  man

If we wish to filter (subset) by two variables, we can apply the logical “or” (|) operator or “and” (&) operator. First, let’s select those cases in which either gender is equal to “woman” or id is greater than 154 using the logical “or” (|) operator. This will retain those cases for whom at least one logical statement is true. Once again, let’s assign the filtered data frame object to an object that we’ll call sub_personaldata; this will overwrite the existing object called sub_personaldata in our R Environment.

# Filter (subset) by gender OR id (application of logical OR operator)
sub_personaldata <- subset(personaldata, gender == "woman" | id > 154)

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 5 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   155 Smith      John      1/9/2016  man   
## 2   165 Doe        Jane      1/4/2016  woman 
## 3   198 Morales    Linda     1/7/2016  woman 
## 4   201 Providence Cindy     1/9/2016  woman 
## 5   282 Legend     John      1/9/2016  man

Now watch what happens if we apply the logical “and” (&) operator by keeping everything the same but swapping out the logical “or” (|) operator with the logical “and” (&) operator. This will retain only those cases for whom both logical statements are true.

# Filter (subset) by gender AND id (application of logical AND operator)
sub_personaldata <- subset(personaldata, gender == "woman" & id > 154)

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 3 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   165 Doe        Jane      1/4/2016  woman 
## 2   198 Morales    Linda     1/7/2016  woman 
## 3   201 Providence Cindy     1/9/2016  woman

We can also use the logical “or” (|) operator to select two values of the same variable. In this example, we will select cases for whom either firstname is equal to “John” or firstname is equal to “Jane”.

# Filter (subset) by two values of firstname using logical OR
sub_personaldata <- subset(personaldata, firstname == "John" | firstname == "Jane")

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 3 × 5
##      id lastname firstname startdate gender
##   <dbl> <chr>    <chr>     <chr>     <chr> 
## 1   155 Smith    John      1/9/2016  man   
## 2   165 Doe      Jane      1/4/2016  woman 
## 3   282 Legend   John      1/9/2016  man

We can select two ranges of values from the same variable using the logical “or” (|) operator, assuming the variable is of type numeric, integer, or date. In this example, we will retain the cases for whom either logical statement is true: id is less than or equal to 154 or id is greater than or equal to 198.

# Filter (subset) by two ranges of values for id using logical OR
sub_personaldata <- subset(personaldata, id <= 154 | id >= 198)

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 6 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   153 Sanchez    Alejandro 1/1/2016  man   
## 2   154 McDonald   Ronald    1/9/2016  man   
## 3   111 Newton     Isaac     1/9/2016  man   
## 4   198 Morales    Linda     1/7/2016  woman 
## 5   201 Providence Cindy     1/9/2016  woman 
## 6   282 Legend     John      1/9/2016  man

Alternatively, can select a single range of values within lower and upper bounds from the same variable by using the logical “and” (&) operator, assuming the variable is of type numeric, integer, or date. In this example, we will retain the cases for whom both logical statements are true: id is greater than or equal to 154 and id is less than or equal to 198.

# Filter (subset) by single range of values for id using logical AND
sub_personaldata <- subset(personaldata, id >= 154 & id <= 198)

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 4 × 5
##      id lastname firstname startdate gender
##   <dbl> <chr>    <chr>     <chr>     <chr> 
## 1   154 McDonald Ronald    1/9/2016  man   
## 2   155 Smith    John      1/9/2016  man   
## 3   165 Doe      Jane      1/4/2016  woman 
## 4   198 Morales  Linda     1/7/2016  woman

The subset function can also be used to remove multiple specific cases (such as from a unique identifier variable), which might be useful when you’ve identified outliers that need to be removed. As a first step, identify a vector of values that need to be removed. In this example, let’s pretend that cases with id variable values of 198 and 201 no longer work for this company, so they should be removed from the sample. To create a vector of these two values, use the c function like this: c(198,201). Next, because you are now filtering by a vector, you will need to use the %in% operator, which is an operator that instructs R to go through each value of the filter variable (id) and identify instances of 198 and 201 (c(198,201)); if the values match, then those cases are retained. However, because we entered ! in front of the filter variable, this actually reverses our logic and instructs R to remove those cases in which a value of the filter variable matches a value contained in the vector.

# Filter out (subset out) id of 198 and 201
sub_personaldata <- subset(personaldata, !id %in% c(198,201))

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 6 × 5
##      id lastname firstname startdate gender
##   <dbl> <chr>    <chr>     <chr>     <chr> 
## 1   153 Sanchez  Alejandro 1/1/2016  man   
## 2   154 McDonald Ronald    1/9/2016  man   
## 3   155 Smith    John      1/9/2016  man   
## 4   165 Doe      Jane      1/4/2016  woman 
## 5   111 Newton   Isaac     1/9/2016  man   
## 6   282 Legend   John      1/9/2016  man

Note that in the output above cases with id variable values equal to 198 and 201 are no longer present.

If we remove the ! in front of the filter (subset) variable, only cases 198 and 201 are retained.

# Filter in (subset in, select) id of 198 and 201
sub_personaldata <- subset(personaldata, id %in% c(198,201))

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 2 × 5
##      id lastname   firstname startdate gender
##   <dbl> <chr>      <chr>     <chr>     <chr> 
## 1   198 Morales    Linda     1/7/2016  woman 
## 2   201 Providence Cindy     1/9/2016  woman

We can also drop specific cases one by one using the not equal to logical operator (!=) and the a unique identifier value associated with the case you wish to remove. We accomplish the same result as above but use two steps instead. Also, note that in the second step below, the new data frame object (sub_personaldata) is used as the first argument in the subset function because we want to retain the changes we made in the prior step (i.e., dropping case with id equal to 198).

# Filter out (subset out) id of 198
sub_personaldata <- subset(personaldata, id != 198)

# Filter out (subset out) id of 201
sub_personaldata <- subset(sub_personaldata, id != 201)

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 6 × 5
##      id lastname firstname startdate gender
##   <dbl> <chr>    <chr>     <chr>     <chr> 
## 1   153 Sanchez  Alejandro 1/1/2016  man   
## 2   154 McDonald Ronald    1/9/2016  man   
## 3   155 Smith    John      1/9/2016  man   
## 4   165 Doe      Jane      1/4/2016  woman 
## 5   111 Newton   Isaac     1/9/2016  man   
## 6   282 Legend   John      1/9/2016  man

When working with variables of type Date, things can get a bit trickier. When we applied the str function from base R (see above), we found that the startdate variable was read in and joined as a character variable as opposed to a date variable. As such, we need to convert the startdate variable using the as.Date function from base R.

  1. Begin by typing the name of the data frame object (personaldata), followed by the $ operator and the name of whatever you would like to name the new variable (startdate2); remember, the $ operator tells R that a variable belongs to (or will belong to) a particular data frame.
  2. Type the <- assignment operator.
  3. Type the name of the as.Date function.
  • As the first argument in the function in the as.Date function, type the name of the as.character function with the name of the data frame object (personaldata), followed by the $ operator and the name the original variable (startdate) as the sole argument within the as.character function. Note that we are nesting the as.character function within the as.Date function, and due to order of operations, the as.character function will be run first, followed by the startdate function.
  • As the second argument in the as.Date function, type format="%m/%d/%Y" to indicate the format for the data variable; note that the capital Y in %Y implies a 4-digit year, whereas a lower case would imply a 2-digit year.
# Convert character startdate variable to the Date type startdate2 variable
personaldata$startdate2 <- as.Date(as.character(personaldata$startdate), format="%m/%d/%Y")

To verify that the new startdate2 variable is of type Date, use the str function from base R, and type the name of the data frame object (personaldata) as the sole argument. As you will see, the new startdate2 variable is now of type Date.

# Verify that the startdate2 variable is now a variable of type Date
str(personaldata)
## spc_tbl_ [8 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id        : num [1:8] 153 154 155 165 111 198 201 282
##  $ lastname  : chr [1:8] "Sanchez" "McDonald" "Smith" "Doe" ...
##  $ firstname : chr [1:8] "Alejandro" "Ronald" "John" "Jane" ...
##  $ startdate : chr [1:8] "1/1/2016" "1/9/2016" "1/9/2016" "1/4/2016" ...
##  $ gender    : chr [1:8] "man" "man" "man" "woman" ...
##  $ startdate2: Date[1:8], format: "2016-01-01" "2016-01-09" "2016-01-09" "2016-01-04" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   lastname = col_character(),
##   ..   firstname = col_character(),
##   ..   startdate = col_character(),
##   ..   gender = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now we are ready to filter (subset) using the new startdate2 variable. When specifying the value of the startdate2 variable by which you wish to filter by, make sure to use the as.Date function once more with the date (formatted as YYYY-MM-DD) in quotation marks (" ") as the sole argument. Here, I select those cases for whom their startdate2 values are greater than 2016-01-07 – or in other words, those cases who started after January 1, 2016.

# Filter (subset) by startdate2
sub_personaldata <- subset(personaldata, startdate2 > as.Date("2016-01-07"))

# Print filtered (subsetted) data frame
print(sub_personaldata)
## # A tibble: 5 × 6
##      id lastname   firstname startdate gender startdate2
##   <dbl> <chr>      <chr>     <chr>     <chr>  <date>    
## 1   154 McDonald   Ronald    1/9/2016  man    2016-01-09
## 2   155 Smith      John      1/9/2016  man    2016-01-09
## 3   111 Newton     Isaac     1/9/2016  man    2016-01-09
## 4   201 Providence Cindy     1/9/2016  woman  2016-01-09
## 5   282 Legend     John      1/9/2016  man    2016-01-09

18.3.4.2 Select Single Variable from Data Frame

To display a single variable from a data frame in our Console, within the subset function, we can do the following:

  1. We’ll use the <- assignment operator to name the data frame that we are about to create. For this example, I chose to create a new data frame object called tempdf, which I placed to the left of the <- assignment operator; that being said, you could name the new data frame object whatever you would like – or you could overwrite the existing data frame object.
  2. To the right of the <- operator, type the name of subset function from base R.
  • As the first argument in the function, type the name of the data frame object from which we wish to select a single variable (personaldata)
  • As the second argument, type select= followed by the name of a single variable (startdate) we wish to select.
# Select only one variable from a data frame (startdate)
tempdf <- subset(personaldata, select=startdate)

# Print data frame
print(tempdf)
## # A tibble: 8 × 1
##   startdate
##   <chr>    
## 1 1/1/2016 
## 2 1/9/2016 
## 3 1/9/2016 
## 4 1/4/2016 
## 5 1/9/2016 
## 6 1/7/2016 
## 7 1/9/2016 
## 8 1/9/2016

18.3.4.3 Select Multiple Variables from Data Frame

If our goal is to select multiple variables from a data frame (and remove all others), we can use the subset function as follows. As we did above, name new data frame object using the <- operator, and we will overwrite the data frame object we created above called tempdf. As the first argument in the subset function, type the name of your original data frame object (mergeddf). As the second argument, type select= followed by a vector of variable name you wish to select/retain. The order in which you enter the variable names will correspond to the order in which they appear in the new data frame object. Use the c (combine) function from base R with each variable name you wish to select as arguments separated by commas. Here we select the lastname, firstname, and gender variables.

# Select multiple variables (lastname, firstname, gender)
tempdf <- subset(personaldata, select=c(lastname, firstname, gender))

# Print data frame
print(tempdf)
## # A tibble: 8 × 3
##   lastname   firstname gender
##   <chr>      <chr>     <chr> 
## 1 Sanchez    Alejandro man   
## 2 McDonald   Ronald    man   
## 3 Smith      John      man   
## 4 Doe        Jane      woman 
## 5 Newton     Isaac     man   
## 6 Morales    Linda     woman 
## 7 Providence Cindy     woman 
## 8 Legend     John      man

18.3.4.4 Remove Single Variable from Data Frame

If you need to remove a single variable from a data frame, you can simply type the minus (-) operator before the name of the variable you wish to remove. Here, we remove the startdate variable.

# Remove one variable from a data frame (startdate)
tempdf <- subset(personaldata, select=-startdate)

# Print data frame
print(tempdf)
## # A tibble: 8 × 5
##      id lastname   firstname gender startdate2
##   <dbl> <chr>      <chr>     <chr>  <date>    
## 1   153 Sanchez    Alejandro man    2016-01-01
## 2   154 McDonald   Ronald    man    2016-01-09
## 3   155 Smith      John      man    2016-01-09
## 4   165 Doe        Jane      woman  2016-01-04
## 5   111 Newton     Isaac     man    2016-01-09
## 6   198 Morales    Linda     woman  2016-01-07
## 7   201 Providence Cindy     woman  2016-01-09
## 8   282 Legend     John      man    2016-01-09

18.3.4.5 Remove Multiple Variables from Data Frame

If you wish to remove multiple variables from a data frame, you can apply the same syntax as you did when selecting multiple variables, except insert a minus (-) operator in from the of the c function. This tells the function to not select those variables. Here, we remove the lastname and firstname variables from the personaldata data frame object and assign the resulting data frame to an object called tempdf.

# Remove multiple variables (lastname, firstname)
tempdf <- subset(personaldata, select= -c(lastname, firstname))

# Print data frame
print(tempdf)
## # A tibble: 8 × 4
##      id startdate gender startdate2
##   <dbl> <chr>     <chr>  <date>    
## 1   153 1/1/2016  man    2016-01-01
## 2   154 1/9/2016  man    2016-01-09
## 3   155 1/9/2016  man    2016-01-09
## 4   165 1/4/2016  woman  2016-01-04
## 5   111 1/9/2016  man    2016-01-09
## 6   198 1/7/2016  woman  2016-01-07
## 7   201 1/9/2016  woman  2016-01-09
## 8   282 1/9/2016  man    2016-01-09

18.3.5 Filter by Pattern Contained within String

In some cases, we may wish to filter cases from a data frame object based on a pattern contained within a string (i.e., text, characters). For example, using the personaldata data frame object we created above, perhaps we would like to select those cases for which their firstname string (i.e., value) contains a capital (“J”). To do so, we can use the str_detect function from the stringr package (Wickham 2019) within either the subset function from base R or the filter function from the dplyr package.

To get started, make sure that you have installed and accessed the stringr package.

# Install stringr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("stringr")
# Access stringr package
library(stringr)

Given that this chapter supplement has focused thus far on the subset function from base R, let’s continue to use that function, but please note that you could just as easily use the filter function from the dplyr package.

  1. Using the <- operator, we’ll assign the resulting subset data frame to an object that I’m calling sub_personaldata.
  2. To the right of the <- operator, type the name of the subset function
  • As the first argument in the subset function, type the name of the original data frame object that we’ve been working with called personaldata.
  • As the second argument, type the name of the str_detect function. As the first argument within the str_detect function, type the name of the variable we wish to filter by, which in this example is firstname; as the second argument and within quotation marks (" "), type a pattern you would like to detect within text strings from the firstname variable. For this example, let’s detect any text string containing an uppercase “J” while noting that case sensitivity matters (e.g., “J” vs. “j”). In other words, we are filtering the data frame such that we will retain only those cases for which their firstname variable text strings (i.e., values) contain an uppercase “J”.
# Select cases for which firstname variable contains a "J"
# Note that case sensitivity matters (e.g., "j" vs. "J")
sub_personaldata <- subset(personaldata, str_detect(firstname, "J"))

Now let’s print the new data frame object we created.

# Print the data frame object
print(sub_personaldata)
## # A tibble: 3 × 6
##      id lastname firstname startdate gender startdate2
##   <dbl> <chr>    <chr>     <chr>     <chr>  <date>    
## 1   155 Smith    John      1/9/2016  man    2016-01-09
## 2   165 Doe      Jane      1/4/2016  woman  2016-01-04
## 3   282 Legend   John      1/9/2016  man    2016-01-09

References

Bache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A Forward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.
———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.