# Chapter 35 Evaluating a Post-Test-Only with Two Comparison Groups Design Using One-Way ANOVA

In this chapter, we learn about the post-test-only with two comparison groups training evaluation design and how a one-way analysis of variance (ANOVA) can be used to analyze the data acquired from this design. We’ll begin with conceptual overviews of this training evaluation design and of the one-way ANOVA, and then we’ll conclude with a tutorial.

## 35.1 Conceptual Overview

In this section, we will begin with a description of the post-test-only with two comparison groups training evaluation design. The section concludes with a review of the one-way analysis of variance (ANOVA), including discussions of statistical assumptions, omnibus *F*-test, post-hoc pairwise mean comparisons, statistical significance, and practical significance; the section wraps up with a sample-write up of a one-way ANOVA used to evaluate data from a post-test-only with two comparison groups training evaluation design.

### 35.1.1 Review of Post-Test-Only with Two Comparison Groups Design

In a **post-test-only with two comparison groups** training evaluation design (i.e., research design), employees are assigned (randomly or non-randomly) to either a treatment group (e.g., new training program), or one of two comparison groups (e.g., old training program and control group), and every participating employee is assessed on selected training outcomes (i.e., measures) after the training has concluded. If *random assignment* to groups is used, then a post-test-only with two comparison groups design is considered *experimental*. Conversely, if *non-random assignment* to groups is used, then the design is considered *quasi-experimental*. Regardless of whether random or non-random assignment is used, a one-way analysis of variance (ANOVA) can be used to analyze the data from a post-test-only with two comparison groups design, provided key statistical assumptions are satisfied.

Like any evaluation design, there are limitations to the inferences and conclusions we can draw from a post-test-only two comparison groups design. As a strength, this design includes two comparison groups, and if coupled with random assignment to groups, then the design qualifies as a true experimental design. With that being said, if we use *non*-random assignment to the different groups, then we are less likely to have equivalent groups of individuals who enter each group, which may bias how they engage in the demands of their respective group and how they complete the outcome measures. Further, because this design lacks a pre-test (i.e., assessment of initial performance on the outcome measures), we cannot be confident that employees in the three groups “started in the same place” with respect to the outcome(s) we might measure at post-test. Consequently, any differences we observe between the three groups on a post-test outcome measure may reflect pre-existing differences – meaning, the training may not have “caused” the differences that are apparent at post-test.

### 35.1.2 Review of One-Way ANOVA

Link to conceptual video: https://youtu.be/kwwU9N8WCfQ

**Analysis of variance (ANOVA)** is part of a family of analyses aimed at comparing means, which includes one-way ANOVA, repeated-measures ANOVA, factorial ANOVA, and mixed-factorial ANOVA. In general, an ANOVA is used to compare three or more means; however, it can be used to compare two means on a single factor – but you might as well just use an independent-samples *t*-test if this is the case. When comparing means, an omnibus *F*-test is employed to determine whether there are any differences in means across the levels of the categorical (nominal, ordinal) predictor variable. In other words, with an ANOVA, we are attempting to reject the null hypothesis that that there are no mean differences across levels of the categorical predictor variable. It is important to remember, however, that the *F*-test is an omnibus test, which means that its *p*-value only indicates whether mean differences exist across two or more means and not where those specific differences in means exist. Typically, post-hoc pairwise comparison tests can be used to uncover which specific pairs of levels (categories) show differences in means.

The **one-way ANOVA** refers to one of the most basic forms of ANOVA – specifically, an ANOVA in which there is only a single categorical predictor variable and a continuous outcome variable. The term “one-way” indicates that there is just a single *factor* (i.e., predictor variable, independent variable). If we were to have two categorical predictor variables, then we could call the corresponding analysis a *factorial ANOVA* or more specifically a *two-way ANOVA*. A one-way ANOVA is employed to test the equality of two or more means on a continuous (interval, ratio) outcome variable (i.e., dependent variable) all at once by using information about the variances. For a one-way ANOVA the null hypothesis is typically that all means are equal, or rather, there are no differences between the means. More concretely, an *F*-test is used as an omnibus test for a one-way ANOVA. In essence, the *F*-test reflects the between-level (between-group, between-category) variance divided by the within-group variance. To calculate the degrees of freedom (*df*) for the numerator (between-group variance), we subtract 1 from the number of groups (*df* = *k* - 1). To calculate the *df* for the denominator (within-group variance), we subtract the number of groups from the number of people in the overall sample (*df* = *n* - *k*).

*Note:* In this chapter, we will focus on exclusively on applying a one-way ANOVA with balanced groups, where *balanced groups* means that each group (i.e., category) has the same number of independent cases. The default approach to calculating sum of squares in most R ANOVA functions is to use what is often referred to as Type I sum of squares. Type II and Type III sum of squares refer to the other approaches to calculating sum of squares for an ANOVA. Results are typically similar between Type I, Type II, and Type III approaches when the data are balanced across groups designated by the factors (i.e., predictor variables). To use Type II and Type III sum of squares, I recommend that you use the `Anova`

function from the `car`

package, which we will not cover in this tutorial. Because we are only considering considering a single between-subjects factor (i.e., one-way ANOVA) and have a balanced design, we won’t concern ourselves with this distinction. If, however, you wish to extend the one-way ANOVA to a *two*-way ANOVA, I recommend checking out this link to the R-Bloggers site.

To illustrate the computational underpinnings of a one-way ANOVA, we can dive into some formulas for the grand mean, total variation, between-group variation, within-group variation, and the *F*-test. For the sake of simplicity, I am presenting formulas in which we will assume equal sample sizes for each level of the categorical predictor variable (i.e., equal sample sizes for each independent sample or group); this is often referred to as a *balanced design*, as noted above.

**Grand Mean:** The formula for the grand mean is as follows:

\(\overline{Y}_{..} = \frac{\sum Y_{ij}}{n}\)

where \(\overline{Y}_{..}\) is the grand mean, \(Y_{ij}\) represents a score on the outcome variable, and \(n\) is the total sample size.

**Sum of Squares Between Groups (Between-Group Variation):** The formula for the sum of squares between groups is as follows:

\(SS_{between} = \sum n_{j}(\overline{Y}_{.j} - \overline{Y}_{..})^{2}\)

where \(SS_{between}\) refers to the sum of squares between groups, \(n_{j}\) is the sample size for each group (i.e., level of the categorical predictor variable) assuming equal sample sizes, \(\overline{Y}_{.j}\) represents each group’s mean on the continuous outcome variable, and \(\overline{Y}_{..}\) is the grand mean for the outcome variable (i.e., sample mean). In essence, \(SS_{between}\) represents variation between the group means. We can compute the variance between groups dividing \(SS_{between}\) by the between-groups degrees of freedom (*df* = *k* - 1), as shown below.

**Variance Between Groups:** The formula for the variance between groups is as follows:

\(s_{between}^{2} = \frac{SS_{between}}{k-1}\)

where \(s_{between}^{2}\) refers to the variance between groups, \(SS_{between}\) refers to the sum of squares between groups, \(k\) is the number of levels (categories, groups) for the categorical predictor variable, and \(k-1\) is the between-groups *df*.

**Sum of Squares Within Groups (Within-Group Variation):** The formula for the sum of squares within groups is as follows:

\(SS_{within} = \sum \sum (Y_{ij} - Y_{.j})^{2}\)

where \(SS_{within}\) refers to the sum of squares within groups (or error variation), \(Y_{ij}\) represents a score on the continuous outcome variable, and \(Y_{.j}\) represents each group’s mean. In essence, \(SS_{within}\) represents variation within the groups. We can compute the variance within groups by dividing \(SS_{within}\) by the within-groups degrees of freedom (*df* = *n* - *k*), as shown below.

**Variance Within Groups:** The formula for the variance within groups is as follows:

\(s_{within}^{2} = \frac{SS_{within}}{n-k}\)

where \(s_{within}^{2}\) refers to the variance within groups, \(SS_{within}\) refers to the sum of squares within groups, \(n\) is the total sample size, \(k\) is the the number of groups (i.e., levels of the categorical predictor variable), and \(n-k\) is the within-groups *df*.

** F-value:** We can compute the

*F*-value associated with the omnibus tests of means using the following formula:

\(F = \frac{s_{between}^{2}}{s_{within}^{2}}\)

where \(F\) is the *F*-test value, \(s_{between}^{2}\) is the variance between groups, and \(s_{within}^{2}\) is the variance within groups.

#### 35.1.2.1 Post-Hoc Pairwise Mean Comparison Tests

As a reminder, the *F*-test indicates whether differences between the group means exist, but it doesn’t indicate which specific pairs of groups differ with respect to their means. Thus, we use post-hoc pairwise mean comparison tests to evaluate which groups have (or do not have) significantly different means and the direction of those differences. Post-hoc tests like Tukey’s test and Fisher’s test help us account for what is called *family-wise error*, where family-wise error refers to the increased likelihood of making Type I errors (i.e., finding something that doesn’t really exist in the population; false positive) because we are running multiple pairwise comparisons and may capitalize on chance. Other tests like Dunnett’s *C* are useful when the assumption of equal variances cannot be met. Essentially, the post-hoc pairwise mean comparison tests are independent-samples *t*-tests that account for the fact that we are making multiple comparisons, resulting in adjustments to the associated *p*-values.

#### 35.1.2.2 Statistical Assumptions

The statistical assumptions that should be met prior to running and/or interpreting estimates from a simple linear regression model include:

- The outcome (dependent, response) variable has a univariate normal distribution in each of the two or more underlying populations (e.g., samples, groups, conditions), which correspond to the two or more categories (levels, groups) of the predictor (independent, explanatory) variable;
- The variances of the outcome (dependent, response) variable are equal across the two or more populations (e.g., levels, groups, categories), which is often called the equality of variances or homogeneity of variances assumption.

#### 35.1.2.3 Statistical Significance

As noted above, for a one-way ANOVA, we use an **omnibus F-test** and associated

*p*-value to test whether there are statistical significant differences across groups means. Using null hypothesis significance testing (NHST), we interpret a

*p*-value that is

*less than .05*(or whatever two- or one-tailed alpha level we set) to meet the standard for statistical significance, meaning that we reject the null hypothesis that the differences between the two or means are equal to zero. In other words, if the

*p*-value is less than .05, we conclude that there are statistically significant differences across the group means. In contrast, if the

*p*-value is

*equal to or greater than .05*, then we fail to reject the null hypothesis that there are differences across the two or more means.

If our omnibus *F*-test is found to be statistical significant, then we will typically move ahead by performing **post-hoc pairwise comparison tests** (e.g., Tukey’s test, Fisher’s test, Dunnett’s *C*) and examine the *p*-value associated with each pairwise comparison test. We interpret a *p*-value that is *less than .05* (or whatever two- or one-tailed alpha level we set) to meet the standard for statistical significance, meaning that we reject the null hypothesis that the difference between the two means is equal to zero. In other words, if the *p*-value is less than .05, we conclude that the two means differ from each other to a statistically significant extent. In contrast, if the *p*-value is *equal to or greater than .05*, then we fail to reject the null hypothesis that the difference between the two means is equal to zero. As noted above, typically, the pairwise comparison tests adjust the *p*-values for family-wise error to reduce the likelihood of making Type I errors (i.e., false positives).

When setting an alpha threshold, such as the conventional two-tailed .05 level, sometimes the question comes up regarding whether borderline *p*-values signify significance or nonsignificance. For our purposes, let’s be very strict in our application of the chosen alpha level. For example, if we set our alpha level at .05, *p* = .049 would be considered statistically significant, and *p* = .050 would be considered statistically nonsignificant.

#### 35.1.2.4 Practical Significance

A significant omnibus *F*-test and associated *p*-value only tells us that the means in question differ to a statistically significant across groups (e.g., levels, categories). It does not, however, tell us about the magnitude of the difference across means – or in other words, the practical significance. Fortunately, there are multiple model-level effect size indicators, like *R*^{2}, \(\eta^{2}\), \(\omega^{2}\), and Cohen’s \(f\). All of these provide an indication of the amount of variance explained by the predictor variable in the outcome variable. In the table below, I provide some qualitative descriptors that we can apply when interpreting the magnitude of one of the effect size indicators. Please note that typically we only interpret practical significance when the *F*-test indicates statistical significance.

R^{2} |
\(\eta^2\) | \(\omega^2\) | Cohen’s f |
Description |
---|---|---|---|---|

.01 | .01 | .01 | .10 | Small |

.09 | .09 | .09 | .25 | Medium |

.25 | .25 | .25 | .40 | Large |

After finding a statistically significant omnibus *F*-test, it is customary to then compute post-hoc pairwise comparisons between specific means to determine which pairs of means differ to a statistically significant extent. If a pair of means is found to show a statistically significant difference, then we will proceed forward with interpreting the magnitude of that difference, typically using an effect size indicator like Cohen’s *d*, which is the standardized mean difference. In essence, the Cohen’s *d* indicates the magnitude of the difference between means in standard deviation units. A *d*-value of .00 would indicate that there is no difference between the two means, while the following are some generally accepted qualitative-magnitude labels we can attach to the absolute value of *d*.

Cohen’s d |
Description |
---|---|

.20 | Small |

.50 | Medium |

.80 | Large |

#### 35.1.2.5 Sample Write-Up

As part of a post-test-only with two comparison groups design, 75 employees were randomly assigned to one (and only one) of three following groups associated with our categorical (nominal, ordinal) predictor variable: no noise, some noise, and loud noise. A total of 25 participants were assigned to each group, resulting in a balanced design. For all employees, verbal fluency was assessed while they were exposed to one of the noise conditions, where verbal fluency serves as the continuous (interval, ratio) outcome variable. We applied a one-way ANOVA to determine whether verbal fluency differed across the levels of noise each group of employees experienced as part of our study. We found a significant omnibus *F*-test, which indicated that there were differences across the means in verbal fluency for the three noise conditions (*F* = 7.55, *p* = .03). The model *R*^{2} was .12, which indicates that 12% of the variance in verbal fluency can be explained by the level of noise employees were exposed to; in other words, in terms of practical significance, the model fit the data to a medium extent. Given the significant omnibus *F*-test, we computed Tukey’s tests to examine the post-hoc pairwise comparisons between each pair of verbal-fluency means. The mean verbal fluency score for the no-noise condition was 77.00 (*SD* = 3.20), 52.00 (*SD* = 3.10) for the some-noise condition, and 50.00 (*SD* = 3.10) for the loud-noise condition. Further, the pairwise comparisons indicated, as expected, that mean verbal fluency was significantly higher for the no-noise condition compared to the some-noise condition (\(M_{diff}\) = 15.00, adjusted *p* = .03) and loud noise condition (\(M_{diff}\) = 17.00, adjusted *p* = .02). We found that the standardized mean difference (Cohen’s *d*) for the two significant differences in means was .96 and .91, respectively, which both can considered large. Finally, a significant difference in means was not found when comparing verbal-fluency scores for the some-noise and no-noise conditions (\(M_{diff}\) = 2.00, adjusted *p* = .34).

## 35.2 Tutorial

This chapter’s tutorial demonstrates how to estimate a one-way ANOVA and post-hoc pairwise mean comparisons, test the associated statistical assumptions, and present the findings in writing and visually.

### 35.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/e6oVV1ynfWo

### 35.2.2 Functions & Packages Introduced

Function | Package |
---|---|

`Plot` |
`lessR` |

`tapply` |
base R |

`shapiro.test` |
base R |

`leveneTest` |
`car` |

`ANOVA` |
`lessR` |

`cohen.d` |
`effsize` |

### 35.2.3 Initial Steps

If you haven’t already, save the file called **“TrainingEvaluation_ThreeGroupPost.csv”** into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., `"H:/RWorkshop"`

). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the `setwd`

function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to *Session > Set Working Directory > Choose Directory…*. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

Next, read in the .csv data file called **“TrainingEvaluation_ThreeGroupPost.csv”** using your choice of read function. In this example, I use the `read_csv`

function from the `readr`

package (Wickham, Hester, and Bryan 2024). If you choose to use the `read_csv`

function, be sure that you have installed and accessed the `readr`

package using the `install.packages`

and `library`

functions. *Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months.* For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

```
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
```

```
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
td <- read_csv("TrainingEvaluation_ThreeGroupPost.csv")
```

```
## Rows: 75 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Condition
## dbl (2): EmpID, PostTest
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

`## [1] "EmpID" "Condition" "PostTest"`

```
## spc_tbl_ [75 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ EmpID : num [1:75] 1 2 3 4 5 6 7 8 9 10 ...
## $ Condition: chr [1:75] "No" "No" "No" "No" ...
## $ PostTest : num [1:75] 74 65 62 68 70 61 79 67 79 59 ...
## - attr(*, "spec")=
## .. cols(
## .. EmpID = col_double(),
## .. Condition = col_character(),
## .. PostTest = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
```

```
## # A tibble: 6 × 3
## EmpID Condition PostTest
## <dbl> <chr> <dbl>
## 1 1 No 74
## 2 2 No 65
## 3 3 No 62
## 4 4 No 68
## 5 5 No 70
## 6 6 No 61
```

There are 75 cases (i.e., employees) and 3 variables in the `td`

data frame: `EmpID`

(unique identifier for employees), `Condition`

(training condition: *New* = new training program, *Old* = old training program, *No* = no training program), and `PostTest`

(post-training scores on training assessment, ranging from 1-100). Regarding participation in the training conditions, 25 employees participated in each condition, with no employee participating in more than one condition; this means that we have a balanced design. Per the output of the `str`

(structure) function above, all of the variables except for `Condition`

are of type *integer* (continuous: interval/ratio), and `Condition`

is of type *character* (nominal/categorical).

### 35.2.4 Test Statistical Assumptions

Prior to estimating and interpreting the one-way ANOVA, let’s generate a VBS (violin-box-scatter) plot to visualize the statistical assumptions regarding the outcome variable having a univariate normal distribution in each of the three training conditions and the variances of the outcome variable being approximately equal across the three conditions. To do so, we’ll use the `Plot`

function from the `lessR`

package (Gerbing, Business, and University 2021). If you haven’t already, install and access the `lessR`

package using the `install.packages`

and `library`

functions, respectively.

Type the name of the `Plot`

function. As the first argument within the function, type the name of the outcome variable (`PostTest`

). As the second argument, type `data=`

followed by the name of the data frame (`td`

). As the third argument, type `by1=`

followed by the name of the grouping variable (`Condition`

), as this will create the trellis (lattice) structure wherein three VBS plots will be created (one for each independent group).

```
## [Trellis graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(PostTest, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(PostTest, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ANOVA(PostTest ~ Condition) # Add the data parameter if not the d data frame
```

```
## PostTest
## - by levels of -
## Condition
##
## n miss mean sd min mdn max
## New 25 0 72.36 6.98 60.00 73.00 84.00
## No 25 0 62.36 8.09 47.00 61.00 79.00
## Old 25 0 69.60 9.11 51.00 70.00 89.00
##
## Max Dupli-
## Level cations Values
## ------------------------------
## New 4 73
## No 3 60
## Old 3 70
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.55 size of plotted points
## out_size: 0.80 size of plotted outlier points
## jitter_y: 1.00 random vertical movement of points
## jitter_x: 0.28 random horizontal movement of points
## bw: 3.43 set bandwidth higher for smoother edges
```

Based on the output from the `Plot`

function, note that (at least visually) the three distributions seem to be roughly normally distributed, and the the variances appear to be approximately equal. These are by no means stringent tests of the statistical assumptions, but they provide us with a cursory understanding of the shape of the distributions and the variances. If we were to see evidence of non-normality across the conditions, then we might (a) transform the outcome variable to achieve normality (if possible), or (b) apply a nonparametric analysis like the *Kruskal-Wallis rank sum test*.

We can also go a step further by testing the normal distribution and equal variances statistical assumptions using statistical tests.

**Assumption of Normally Distributed Outcome Variable Scores for Each Level of Predictor Variable:** As a reminder, the first statistical assumption is that the outcome variable has a univariate normal distribution in each of the underlying populations (e.g., groups, conditions), which correspond to the levels of the categorical predictor variable. The **Shapiro-Wilk normality test** can be used to test the null hypothesis that a distribution is normal; if the *p*-value associated with the test statistic (*W*) is less than the conventional alpha level of .05, then we would *reject* the null hypothesis and assume that the distribution is *not* normal. If, however, we *fail to reject* the null hypothesis, then we do not have statistical evidence that the distribution is anything other than normal. In other words, if the *p*-value is equal to or greater than our alpha level (.05), then we can assume the variable is normally distributed.

To compute the Shapiro-Wilk normality test, we will use the `shapiro.test`

function from base R. Because we need to test the assumption of normality of the outcome variable (`PostTest`

) for all three levels of the predictor variable (`Condition`

), we also need to use the `tapply`

function from base R. The `tapply`

function can be quite useful, as it allows us to apply a function to a variable for each level of another categorical variable. To begin, type the name of the `tapply`

function. As the first argument, type the name of the data frame (`td`

), followed by the `$`

symbol and the name of the outcome variable (`PostTest`

). As the second argument, type the name of the data frame (`td`

), followed by the `$`

symbol and the name of the categorical predictor variable (`Condition`

). Finally, as the third argument, type the name of the `shapiro.test`

function.

```
# Compute Shapiro-Wilk normality test for normal distributions
tapply(td$PostTest, td$Condition, shapiro.test)
```

```
## $New
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.95019, p-value = 0.2533
##
##
## $No
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.97029, p-value = 0.6525
##
##
## $Old
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.98644, p-value = 0.977
```

In the output, we can see that the `PostTest`

variable is normally distributed for those in the *New* training condition (*W* = .95019, *p* = .2533), the *Old* training condition (*W* = .98644, *p* = .977), and the *No* training condition (*W* = .97029, *p* = .6525). That is, because the *p*-values were each equal to or greater than .05, we failed to reject the null hypothesis that the distributions of outcome variable scores were normal. Thus, we have statistical support for having met the first assumption.

**Assumption of Equal Variances (Homogeneity of Variances):** As for the equal variances assumption, **Levene’s test** (i.e., homogeneity of variances test) is commonly used. The null hypothesis of this test is that the variances of the outcome variable are equal across levels of the categorical predictor variable. Thus, if the *p*-value is less than the conventional alpha level of .05, then we reject the null hypothesis and assume the variances are different. If, however, the *p*-value is equal to or less than .05, then we fail to reject the null hypothesis and assume that the variances are equal (i.e., variances are homogeneous).

To test the equality (homogeneity) of variances assumption, we will use the `leveneTest`

function from the `car`

package. More than likely the `car`

package is already installed on your computer, as many other packages are dependent on it. That being said, you may still need to install the package prior to accessing it using the `library`

function.

Type the name of the `leveneTest`

function. As the first argument, specify the statistical model. To do so, type the name of the outcome (dependent) variable (`PostTest`

) to the left of the `~`

symbol and the name of the predictor (independent) variable (`Condition`

) to the right of the `~`

symbol. For the second argument, use `data=`

to specify the name of the data frame (`td`

).

```
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.6499 0.5251
## 72
```

In the output, we see that the test is nonsignificant (*F* = .6499, *p* = .5251), which suggests that, based on this test, we have no reason to believe that the three variances are anything but equal. In other words, because the *p*-value for this test is equal to or greater than .05, we fail to reject the null hypothesis that the variances are equal. All in all, we found evidence to support that we met the two statistical assumptions necessary to proceed forward with estimating our one-way ANOVA.

### 35.2.5 Estimate One-Way ANOVA

There are different functions that can be used to run a one-way ANOVA in R. In this chapter, we will review how to run a one-way ANOVA using the `ANOVA`

function from the `lessR`

package, and if you’re interested, in the chapter supplement I demonstrate how to carry out the same processes using the `aov`

function from base R.

Using `ANOVA`

function from the `lessR`

package, we will evaluate whether the means on the post-test (`PostTest`

) continuous outcome variable differ between levels of the `Condition`

categorical predictor variable (*New*, *Old*, *No*); in other words, let’s find out if we should treat the means as being different from one another.

A big advantage of using the `ANOVA`

function from the `lessR`

package to estimate a one-way ANOVA is that the function automatically generates descriptive statistics, the omnibus *F*-test and associated indicators of effect size (i.e., practical significance), and post-hoc pairwise comparisons.

If you haven’t already, install and access the `lessR`

package using the `install.packages`

and `library`

functions, respectively.

Now we’re ready to estimate a one-way ANOVA. To begin, type the name of the `ANOVA`

function. As the first argument in the parentheses, specify the statistical model. To do so, type the name of the continuous outcome variable (`PostTest`

) to the left of the `~`

symbol and the name of the categorical predictor variable (`Condition`

) to the right of the `~`

symbol. For the second argument, use `data=`

to specify the name of the data frame where the outcome and predictor variables are located (`td`

).

```
##
## BACKGROUND
##
## Response Variable: PostTest
##
## Factor Variable: Condition
## Levels: New No Old
##
## Number of cases (rows) of data: 75
## Number of cases retained for analysis: 75
##
##
## DESCRIPTIVE STATISTICS
##
## n mean sd min max
## New 25 72.36 6.98 60.00 84.00
## No 25 62.36 8.09 47.00 79.00
## Old 25 69.60 9.11 51.00 89.00
##
## Grand Mean: 68.107
##
##
## ANOVA
##
## df Sum Sq Mean Sq F-value p-value
## Condition 2 1333.63 666.81 10.16 0.0001
## Residuals 72 4727.52 65.66
##
## R Squared: 0.220
## R Sq Adjusted: 0.198
## Omega Squared: 0.196
##
##
## Cohen's f: 0.494
##
##
## TUKEY MULTIPLE COMPARISONS OF MEANS
##
## Family-wise Confidence Level: 0.95
## -----------------------------------
## diff lwr upr p adj
## No-New -10.00 -15.48 -4.52 0.00
## Old-New -2.76 -8.24 2.72 0.45
## Old-No 7.24 1.76 12.72 0.01
##
##
## RESIDUALS
##
## Fitted Values, Residuals, Standardized Residuals
## [sorted by Standardized Residuals, ignoring + or - sign]
## [res_rows = 20, out of 75 cases (rows) of data, or res_rows="all"]
## -----------------------------------------------
## Condition PostTest fitted residual z-resid
## 75 Old 89.00 69.60 19.40 2.44
## 57 Old 51.00 69.60 -18.60 -2.34
## 7 No 79.00 62.36 16.64 2.10
## 9 No 79.00 62.36 16.64 2.10
## 64 Old 54.00 69.60 -15.60 -1.96
## 25 No 47.00 62.36 -15.36 -1.93
## 73 Old 56.00 69.60 -13.60 -1.71
## 60 Old 83.00 69.60 13.40 1.69
## 32 New 60.00 72.36 -12.36 -1.56
## 29 New 84.00 72.36 11.64 1.47
## 1 No 74.00 62.36 11.64 1.47
## 56 Old 81.00 69.60 11.40 1.44
## 33 New 61.00 72.36 -11.36 -1.43
## 42 New 61.00 72.36 -11.36 -1.43
## 20 No 51.00 62.36 -11.36 -1.43
## 35 New 83.00 72.36 10.64 1.34
## 43 New 83.00 72.36 10.64 1.34
## 28 New 62.00 72.36 -10.36 -1.30
## 41 New 82.00 72.36 9.64 1.21
## 51 Old 79.00 69.60 9.40 1.18
##
## ----------------------------------------
## Plot 1: 95% family-wise confidence level
## Plot 2: Scatterplot with Cell Means
## ----------------------------------------
```

As you can see in the output, the `ANOVA`

function provides background information about your variables, descriptive statistics, an omnibus statistical significance test of the mean comparison, post-hoc pairwise mean comparisons, and indicators of practical significance. In addition, the default data visualizations include a scatterplot with the cell (group) means and a chart with mean differences between conditions presented.

**Background**: The *Background* section provides information about the name of the data frame, the name of the response (i.e., outcome) variable, the factor (i.e., categorical predictor) variable and its levels (i.e., conditions, groups), and the number of cases.

**Descriptive Statistics**: The *Descriptive Statistics* section includes basic descriptive statistics about the sample. In the output, we can see that that there are 25 employees in each condition (*n* = 25), and descriptively, the mean `PostTest`

score for the *New* training condition is 72.36 (*SD* = 6.98), the mean `PostTest`

score for the *Old* training condition is 69.60 (*SD* = 9.11), and the mean `PostTest`

score for the *No* training condition is 62.36 (*SD* = 8.09). The grand mean (i.e., overall mean for the entire sample) is 68.107. Thus, descriptively we can see that the condition means are not the same, but the question remains whether these means are different from one another to a statistically significant extent.

**Basic Analysis:** In the *Basic Analysis* section of the output, you will find the statistical test of the null hypothesis (i.e., the means are equal). This is called an omnibus test because we are testing whether or not their is evidence that we should treat *all* means as equal. First, in the *Summary Table*, take a look at the line prefaced with *Condition*; in this line, you will find the degrees of freedom (*df*), the sum of squares (`Sum Sq`

), and the mean square (`Mean Sq`

) between groups/conditions; in addition, you will find the omnibus *F*-value and its associated *p*-value. The *F*-value and its associated *p*-value reflect the null hypothesis significance test of the means being equal, and because the *p*-value is less than the conventional two-tailed alpha of .05, we reject the null hypothesis that means are equal (*F* = 10.16, *p* < .001); meaning, we have evidence that at least two of the means differ from one another to a statistically significant extent. But which ones? To answer this question, we will need to look at the pairwise mean comparison tests later in the output. Next, look at the *Association and Effect Size* table. The (unadjusted) *R*-squared (*R*^{2}) value indicates the extent to which the predictor variable explains variance in the outcome variable in this sample; if you multiply the value by 100, you get a percentage. In this case, we find that 22% of the variance in `PostTest`

scores is explained by the different levels of the `Condition`

variable (i.e., *New*, *Old*, *No*) for this example. The *adjusted* *R*^{2} value, however, is an indicator of the magnitude of the association in the underlying *population* (as opposed to specifically for this sample), and here we see that the adjusted *R*^{2} value is .20 (or 20%). We also find information about other effect-size indicators, including omega-squared (\(\omega\)^{2}) and Cohen’s *f*. \(\omega\)^{2} is a population-level indicator of effect size like the adjusted *R*^{2} value, and like the adjusted *R*^{2} value will tend to be smaller. Some functions compute an effect size indicator called eta-squared (\(\eta\)^{2}), which is equivalent to an unadjusted *R*^{2} value in this context. Finally, Cohen’s *f* focuses not on the variance explained (i.e., the association) but on the magnitude of the differences in means between groups/conditions. By most standards, these effect sizes would be considered to be a medium-to-large or large in magnitude; remember, these effect-size indicators correspond to the omnibus *F*-test. In the table below, I provide conventional rules of thumb for qualitatively interpreting the magnitude of *R*^{2} (adjusted or unadjusted) and Cohen’s *f*; I suggest picking one and using it consistently. *Finally, please note that typically we only interpret practical significance when a difference has been found to be statistically significant.*

R^{2} |
\(\eta^2\) | \(\omega^2\) | Cohen’s f |
Description |
---|---|---|---|---|

.01 | .01 | .01 | .10 | Small |

.09 | .09 | .09 | .25 | Medium |

.25 | .25 | .25 | .40 | Large |

**Tukey Multiple Comparisons of Means**: Recall that based on the omnibus *F*-test above, we found evidence that the group means were *not* equal; in other words, the *p*-value associated with your *F*-value indicated the means differed significantly across the groups. Because the omnibus *F*-test indicated statistical significance at the model level, we should proceed forward with post-hoc pairwise mean comparison tests, such as *Tukey’s test*. If the omnibus test had not been statistically significant, then we would *not* proceed forward with interpreting the post-hoc pairwise mean comparison tests. In the *Tukey Multiple Comparisons of Means* section, we find the pairwise mean comparisons corrected for family-wise error based on Tukey’s approach. Family-wise error refers to instances in which we run multiple statistical tests, which means that we may be more likely to capitalize on chance when searching for statistically significant finds. When tests are adjusted for family-wise error, the *p*-values (or confidence intervals) are corrected (i.e., penalized) for the fact that multiple statistical tests were run, thereby increasing the threshold for finding a statistically significant result. Each row in the pairwise-comparison table in the output shows the raw difference (i.e., `diff`

) in means between the two specified groups, such that the first group mean is subtracted from the second group mean listed. Next, the lower (i.e., `lwr`

) and upper (i.e., `upr`

) 95% confidence interval limits are presented. Finally, the adjusted p-value is presented (i.e., `p adj`

). In our output, we find that employees who participated in the *No* training condition scored, on average, 10.00 points lower (-10.00) on their post-test (`PostTest`

) assessment than employees who participated in the *New* training condition, which is a statistically significant difference (*p*-adjusted < .01, 95% CI[-15.48, -4.52]). Similarly, employees who participated in the *Old* training condition scored, on average, 7.24 points higher on their post-test (`PostTest`

) assessment than employees who participated in the *No* training condition (*p*-adjusted = .01, 95% CI[1.76, 12.72]). Note that, as evidenced by the 95% confidence intervals, the uncertainty around the mean difference between the *Old* and *No* training conditions appears to be notably greater than the uncertainty around the mean difference between the *New* and *No* training conditions. Finally, we find that the mean difference of -2.76 between the *New* and the *Old* training conditions is *not* statistically significant (*p*-adjusted = .45, 95% CI[-8.24, 2.72]).

Note that the `ANOVA`

function from `lessR`

does *not* provide effect size estimates for the post-hoc pairwise mean comparisons, so if you would like those, you can do the following.

**Effect Sizes of Significant Post-Hoc Pairwise Mean Comparisons:** There are various ways that we could go about computing an effect size such as Cohen’s *d* for those statistically significant post-hoc pairwise mean comparisons. In the post-hoc pairwise mean comparisons section of the output, we identified that the *New* and *Old* training conditions resulted in significantly higher post-training assessment (`PostTest`

) scores compared to the *No* training condition. The question then becomes: How much better than the *No* training condition are the *New* and *Old* training conditions?

To compute Cohen’s *d* as an estimate of practical significance we will use the `cohen.d`

function from the `effsize`

package. If you haven’t already, install the `effsize`

package. Make sure to access the package using the `library`

function.

As the first argument in the `cohen.d`

function parentheses, type the name of the continuous outcome variable (`PostTest`

) to the left of the `~`

symbol and the name of the categorical predictor variable (`Condition`

) to the right of the `~`

symbol. For the second argument, we are going to apply the `subset`

function from base R after `data=`

to indicate that we only want to run a subset of our data frame. The `subset`

function is a simpler version of the `filter`

function from `dplyr`

. Why are we doing this? The `cohen.d`

function will only allow predictor variables with two levels, and our `Condition`

variable has three levels: *New*, *Old*, and *No*. After `data=`

, type `subset`

, and within the `subset`

function parentheses, enter the name of the data frame (`td`

) as the first argument and a conditional statement that removes one of the three predictor variable levels (`Condition!="Old"`

); in this first example, we remove the *Old* level so that we can compare just the *New* and *Old* conditions. Back to the `cohen.d`

function, as the third argument, type `paired=FALSE`

to indicate that the data are *not* paired (i.e., the data are not dependent).

```
# Compute Cohen's d for New and No condition means
cohen.d(PostTest ~ Condition, data=subset(td, Condition!="Old"), paired=FALSE)
```

```
##
## Cohen's d
##
## d estimate: 1.324165 (large)
## 95 percent confidence interval:
## lower upper
## 0.6962342 1.9520949
```

The output indicates that Cohen’s *d* is 1.324, which would be considered large by conventional cutoff standards (see table below).

Cohen’s d |
Description |
---|---|

.20 | Small |

.50 | Medium |

.80 | Large |

Let’s repeat the same process as above, except this time we will focus on the *Old* and *No* levels of the `Condition`

predictor variable by removing the level called *New*.

```
# Compute Cohen's d for Old and No condition means
cohen.d(PostTest ~ Condition, data=subset(td, Condition!="New"), paired=FALSE)
```

```
##
## Cohen's d
##
## d estimate: -0.8407151 (large)
## 95 percent confidence interval:
## lower upper
## -1.4339989 -0.2474312
```

The output indicates that Cohen’s *d* is .841, which is large but not as large as the Cohen’s *d* we saw when comparing the `PostTest`

means for the *New* and *No* training conditions. Note: Cohen’s *d* was actually negative (-.841), but typically we just report the absolute value in this context, as the negative or positive sign of a Cohen’s *d* simply indicates which mean was subtracted from the other mean; and reversing this order would result in the opposite sign.

**Sample Write-Up:** To evaluate the effectiveness of a new training program, we applied a post-test-only with two comparison groups training evaluation design. In total, 25 employees participated in the new training program, 25 employees participated in the old training program, and 25 employees did not participate in a training program. After completing their respective training conditions, employees were assessed on the knowledge they acquired during training, where scores could range from 1-100. We found that post-training assessments differed across training conditions to a statistically significant extent (*F* = 10.16, *p* < .001); together, participation in the different training conditions explained 20% of the variability in post-training assessment scores (*R*^{2} = .22; *R*^{2}_{adjusted} = .20). Results of follow-up tests indicated that employees who participated in the new training program performed, on average, 10.00 points better on their post-training assessment than those who did not participate in a training program (*p*_{adjusted} < .01, 95% CI[4.52, 15.48]), which was a large difference (*d* = 1.324). Further, employees who participated in the old training program performed, on average, 7.24 points better on their post-training assessment than those who did not participate in a training program (*p*_{adjusted} < .01, 95% CI[1.76, 12.72]), which was a large difference (*d* = .841). Average post-training assessment scores were not found to differ to a statistically significant extent for those who participated in the new versus old training programs (*M*_{difference} = 2.76, *p*_{adjusted} = .45, 95% CI[-8.24, 2.72]).

*Note: When interpreting the results, I flipped the sign (+ vs. -) of some of the findings to make the interpretation more consistent. Feel free to do the same.*

### 35.2.6 Visualize Results Using Bar Chart

When we find a statistically significant difference between two or more pairs of means based on an one-way ANOVA, we may want to present the means in a bar chart to facilitate storytelling. To do so, we will use the `BarChart`

function from `lessR`

. If you haven’t already, install and access the `lessR`

package using the `install.packages`

and `library`

functions, respectively.

Type the name of the `BarChart`

function. As the first argument, type `x=`

followed by the name of the categorical predictor variable (`Condition`

). As the second argument, type `y=`

followed by the name of the continuous outcome variable (`PostTest`

). As the third argument, specify `stat="mean"`

to request the application of the mean function to the `PostTest`

variable based on the levels of the `Condition`

variable. As the fourth argument, type `data=`

followed by the name of the data frame object to which our predictor and outcome variables belong (`td`

). As the fifth argument, use `xlab=`

to provide the x-axis label (`"Training Condition"`

). As the sixth argument, use `ylab=`

to provide the y-axis label (`"Post-Test Score"`

).

```
# Create bar chart
BarChart(x=Condition, y=PostTest,
stat="mean",
data=td,
xlab="Training Condition",
ylab="Post-Test Score")
```

```
## PostTest
## - by levels of -
## Condition
##
## n miss mean sd min mdn max
## New 25 0 72.36 6.98 60.00 73.00 84.00
## No 25 0 62.36 8.09 47.00 61.00 79.00
## Old 25 0 69.60 9.11 51.00 70.00 89.00
```

```
## >>> Suggestions
## Plot(PostTest, Condition) # lollipop plot
##
## Plotted Values
## --------------
## New No Old
## 72.360 62.360 69.600
```

### 35.2.7 Summary

In this chapter, we learned how to estimate a one-way ANOVA using the `ANOVA`

function from the `lessR`

package. We also learned how to test statistical assumptions, compute post-hoc pairwise mean comparisons, estimate an effect size for the omnibus test, and estimate effect sizes for the pairwise mean comparisons.

## 35.3 Chapter Supplement

In addition to the `ANOVA`

function from the `lessR`

package covered above, we can use the `aov`

function from base R to estimate an one-way ANOVA. Because this function comes from base R, we do not need to install and access an additional package. In this supplement, you will also have an opportunity to learn how to make an APA (American Psychological Association) style table of the one-way ANOVA results.

### 35.3.1 Functions & Packages Introduced

Function | Package |
---|---|

`aov` |
base R |

`summary` |
base R |

`anova_stats` |
`sjstats` |

`TukeyHSD` |
base R |

`plot` |
base R |

`mean` |
base R |

`cohen.d` |
`effsize` |

`apa.aov.table` |
`apaTables` |

`apa.1way.table` |
`apaTables` |

`apa.d.table` |
`apaTables` |

### 35.3.2 Initial Steps

If required, please refer to the Initial Steps section from this chapter for more information on these initial steps.

```
# Install readr package if you haven't already
# [Note: You don't need to install a package every
# time you wish to access it]
install.packages("readr")
```

```
# Access readr package
library(readr)
# Read data and name data frame (tibble) object
td <- read_csv("TrainingEvaluation_ThreeGroupPost.csv")
```

```
## Rows: 75 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Condition
## dbl (2): EmpID, PostTest
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

`## [1] "EmpID" "Condition" "PostTest"`

```
## spc_tbl_ [75 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ EmpID : num [1:75] 1 2 3 4 5 6 7 8 9 10 ...
## $ Condition: chr [1:75] "No" "No" "No" "No" ...
## $ PostTest : num [1:75] 74 65 62 68 70 61 79 67 79 59 ...
## - attr(*, "spec")=
## .. cols(
## .. EmpID = col_double(),
## .. Condition = col_character(),
## .. PostTest = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
```

```
## # A tibble: 6 × 3
## EmpID Condition PostTest
## <dbl> <chr> <dbl>
## 1 1 No 74
## 2 2 No 65
## 3 3 No 62
## 4 4 No 68
## 5 5 No 70
## 6 6 No 61
```

### 35.3.3 `aov`

Function from Base R

The `aov`

function from base R offers another route for running a one-way ANOVA. Prior to using the `aov`

function, it is advisable to perform statistical tests to give us a better understanding if we have satisfied the two statistical assumptions necessary to estimate and interpret a one-way ANOVA; rather than repeat those same diagnostics tests here, please refer to the Test Statistical Assumptions section to learn how to perform those tests.

Assuming we have satisfied the statistical assumptions, we are now ready to estimate the one-way ANOVA. To begin, come up with a name for the one-way ANOVA model that you are specifying; you can call this model object whatever you’d like, and here I refer to it as `model1`

. To the right of the model name, type the `<-`

operator to indicate that you are assigning the one-way ANOVA model to the object. Next, to the right of the `<-`

operator, type the name of the `aov`

function from base R. As the first argument, type the name of the continuous outcome variable (`PostTest`

) to the left of the `~`

operator and the name of the categorical predictor variable (`Condition`

) to the right of the `~`

operator. For the second argument, use `data=`

to specify the name of the data frame (`td`

). On the next line, type the name of the `summary`

function from base R, and as the sole argument, enter the name of the model object you created and named on the previous line.

```
# One-way ANOVA using aov function from base R
model1 <- aov(PostTest ~ Condition, data=td)
summary(model1)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## Condition 2 1334 666.8 10.16 0.00013 ***
## Residuals 72 4728 65.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Note that the `aov`

output provides you with the results of the one-way ANOVA. In the output table, take a look at the line prefaced with *Condition*; in this line, you will find the degrees of freedom (*df*), the sum of squares (`Sum Sq`

), and the mean square (`Mean Sq`

) between groups/conditions; in addition, you will find the omnibus *F*-value and its associated *p*-value. The *F*-value and its associated *p*-value reflect the null hypothesis significance test of the means being equal, and because the *p*-value is less than the conventional two-tailed alpha of .05, we reject the null hypothesis that means are equal (*F* = 10.16, *p* = .00013); meaning, we have evidence that at least two of the means differ from one another to a statistically significant extent.

But how big is this effect? To assess the practical significance of a statistically significant effect, we will apply the `anova_stats`

function from the `sjstats`

package. If you haven’t already, install and access the `sjstats`

package.

```
# Install package
install.packages("sjstats")
# Note: You may also have to install the pwr package independently
# install.packages("pwr")
```

Within the `anova_stats`

function parentheses, enter the name of the object for the one-way ANOVA model you created (`model1`

).

```
## term | df | sumsq | meansq | statistic | p.value | etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f | power
## --------------------------------------------------------------------------------------------------------------------------------------------
## Condition | 2 | 1333.627 | 666.813 | 10.156 | < .001 | 0.220 | 0.220 | 0.196 | 0.196 | 0.198 | 0.531 | 0.986
## Residuals | 72 | 4727.520 | 65.660 | | | | | | | | |
```

In the output, we see the same model information regarding degrees of freed sum of squares, mean of squares, *F*-value, and *p*-value, but we also see estimates for effect-size indicators like eta-squared (\(\eta^2\)), omega-squared (\(\omega^2\)), and Cohen’s *f*. As a reminder, \(\eta^2\) will be equivalent to an *un*adjusted *R*^{2} value in this context, and \(\omega^2\) will be equivalent to an adjusted *R*^{2} value. Finally, unlike the aforementioned effect-size indicators, Cohen’s *f* focuses not on the variance explained (i.e., the association) but rather on the magnitude of the differences in means between groups. By most standards, these effect sizes would be considered to be a medium-large or large in magnitude; remember, these effect-size indicators correspond to the omnibus *F*-test. In the table below, I provide conventional rules of thumb for qualitatively interpreting the magnitude of these effect sizes. *Please note that typically we only interpret practical significance when a difference has been found to be statistically significant.*

R^{2} |
\(\eta^2\) | \(\omega^2\) | Cohen’s f |
Description |
---|---|---|---|---|

.01 | .01 | .01 | .10 | Small |

.09 | .09 | .09 | .25 | Medium |

.25 | .25 | .25 | .40 | Large |

Based on the omnibus *F*-test we know that these means are not equivalent to one another, and we know that the effect is medium-large or large in magnitude. What we don’t yet know is which pairs of means are significantly from one another and by how much. To answer this question, we will need to run some post-hoc pairwise comparison tests.

**Tukey Multiple Comparisons of Means**: Recall that based on the omnibus *F*-test above, we found evidence that the group means were not equal; in other words, the *p*-value associated with your *F*-value indicated a statistically significant finding. Because the omnibus test was statistically significant, we should proceed forward with post-hoc pairwise mean comparison tests, such as *Tukey’s test*. If the omnibus test had not been statistically significant, then we would *not* proceed forward to interpret the post-hoc pairwise mean comparison tests.

To compute Tukey’s test, we will use the `TukeyHSD`

function from base R. First, come up with a name for the object that will contain the results of our Tukey’s test. In this example, I call this model `TukeyTest`

. Second, use the `<-`

operator to indicate that you’re creating a new object based on the results of the `TukeyHSD`

function that you will write next. Third, type the name of the `TukeyHSD`

function, and within the parentheses, enter the name of the one-way ANOVA model object that you specified above (`model1`

). Finally, on the next line, type the name of the `print`

function from base R, and enter the name of the `TukeyTest`

object you created as the sole argument.

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PostTest ~ Condition, data = td)
##
## $Condition
## diff lwr upr p adj
## No-New -10.00 -15.484798 -4.515202 0.0001234
## Old-New -2.76 -8.244798 2.724798 0.4545618
## Old-No 7.24 1.755202 12.724798 0.0064718
```

In the output, we find the pairwise mean comparisons corrected for family-wise error based on Tukey’s approach. Family-wise error refers to instances in which we run multiple statistical tests, which means that we may be more likely to capitalize on chance when searching for statistically significant finds. When tests are adjusted for family-wise error, the *p*-values (or confidence intervals) are corrected (i.e., penalized) for the fact that multiple statistical tests were run, thereby increasing the threshold for finding a statistically significant result. Each row in the pairwise-comparison table in the output shows the raw difference (i.e., `diff`

) in means between the two specified groups, such that the first group mean is subtracted from the second group mean listed. Next, the lower (i.e., `lwr`

) and upper (i.e., `upr`

) 95% confidence interval limits are presented. Finally, the adjusted *p*-value is presented (i.e., `p adj`

). In our output, we find that employees who participated in the *No* training condition scored, on average, 10.00 points lower (-10.00) on their post-test (`PostTest`

) assessment than employees who participated in the *New* training condition, which is a statistically significant difference (*p*-adjusted < .01, 95% CI[-15.48, -4.52]). Similarly, employees who participated in the *Old* training condition scored, on average, 7.24 points higher on their post-test (`PostTest`

) assessment than employees who participated in the *No* training condition (*p*-adjusted = .01, 95% CI[1.76, 12.72]). Note that, as evidenced by the 95% confidence intervals, the uncertainty around the mean difference between the *Old* and *No* training conditions appears to be notably greater than the uncertainty around the mean difference between the *New* and *No* training conditions. Finally, we find that the mean difference of -2.76 between the *New* and the *Old* training conditions is *not* statistically significant (*p*-adjusted = .45, 95% CI[-8.24, 2.72]).

We can also plot the pairwise mean comparisons by entering the name of the `TukeyTest`

object we created as the sole argument in the `plot`

function from base R.

The `TukeyHSD`

function from base R does not provide effect size estimates for the post-hoc pairwise mean comparisons, so if you would like those, you can do the following.

**Effect Sizes of Significant Post-Hoc Pairwise Comparisons:** There are various ways that we could go about computing an effect size such as Cohen’s *d* for those post-hoc pairwise mean comparisons that were statistically significant. In the post-hoc pairwise mean comparisons, we identified that the *New* and *Old* training conditions resulted in significantly higher post-training assessment (`PostTest`

) scores compared to the *No* training condition. The question then becomes: How much better than the *No* training condition are the *New* and *Old* training conditions?

To compute Cohen’s *d* as an estimate of practical significance we will use the `cohen.d`

function from the `effsize`

package (Torchiano 2020). If you haven’t already, install the `effsize`

package. Make sure to access the package using the `library`

function.

As the first argument in the `cohen.d`

function parentheses, type the name of the continuous outcome variable (`PostTest`

) to the left of the `~`

operator and the name of the categorical predictor variable (`Condition`

) to the right of the `~`

operator. For the second argument, we are going to apply the `subset`

function from base R after `data=`

to indicate that we only want to run a subset of our data frame. The `subset`

function is a simpler version of the `filter`

function from `dplyr`

. Why are we doing this? The `cohen.d`

function will only allow predictor variables with two levels/categories, and our `Condition`

variable has three levels: *New*, *Old*, and *No*. After `data=`

, type `subset`

, and within the `subset`

function parentheses, enter the name of the data frame (`td`

) as the first argument and a conditional statement that removes one of the three predictor variable levels (`Condition!="Old"`

); in this first example, we remove the *Old* level so that we can compare just the *New* and *Old* conditions. Back to the `cohen.d`

function, as the third argument, type `paired=FALSE`

to indicate that the data are *not* paired (i.e., the data are not dependent).

```
# Compute Cohen's d for New and No condition means
cohen.d(PostTest ~ Condition, data=subset(td, Condition!="Old"), paired=FALSE)
```

```
##
## Cohen's d
##
## d estimate: 1.324165 (large)
## 95 percent confidence interval:
## lower upper
## 0.6962342 1.9520949
```

The output indicates that Cohen’s *d* is 1.324, which would be considered large by conventional cutoff standards (see table below).

Cohen’s d |
Description |
---|---|

.20 | Small |

.50 | Medium |

.80 | Large |

Let’s repeat the same process as above, except this time we will focus on the *Old* and *No* levels of the `Condition`

predictor variable by removing the level called *New*.

```
# Compute Cohen's d for Old and No condition means
cohen.d(PostTest ~ Condition, data=subset(td, Condition!="New"), paired=FALSE)
```

```
##
## Cohen's d
##
## d estimate: -0.8407151 (large)
## 95 percent confidence interval:
## lower upper
## -1.4339989 -0.2474312
```

The output indicates that Cohen’s *d* is .841, which is large but not as large as the Cohen’s *d* we saw when comparing the `PostTest`

means for the *New* and *No* training conditions. Note: Cohen’s *d* was actually negative (-.841), but typically we just report the absolute value in this context, as the negative or positive sign of a Cohen’s *d* simply indicates which mean was subtracted from the other mean; and reversing this order would result in the opposite sign.

**Sample Write-Up:** To evaluate the effectiveness of a new training program, we applied a post-test-only with two comparison groups training evaluation design. In total, 25 employees participated in the new training program, 25 employees participated in the old training program, and 25 employees did not participate in a training program. After completing their respective training conditions, employees were assessed on the knowledge they acquired during training, where scores could range from 1-100. We found that post-training assessments differed across training conditions to a statistically significant extent (*F* = 10.16, *p* < .001); together, participation in the different training conditions explained 20% of the variability in post-training assessment scores (*R*^{2} = .22; *R*^{2}_{adjusted} = .20). Results of follow-up tests indicated that employees who participated in the new training program performed, on average, 10.00 points better on their post-training assessment than those who did not participate in a training program (*p*_{adjusted} < .01, 95% CI[4.52, 15.48]), which was a large difference (*d* = 1.324). Further, employees who participated in the old training program performed, on average, 7.24 points better on their post-training assessment than those who did not participate in a training program (*p*_{adjusted} < .01, 95% CI[1.76, 12.72]), which was a large difference (*d* = .841). Average post-training assessment scores were not found to differ to a statistically significant extent for those who participated in the new versus old training programs (*M*_{difference} = 2.76, *p*_{adjusted} = .45, 95% CI[-8.24, 2.72]).

*Note: When interpreting the results, I flipped the sign (+ vs. -) of some of the findings to make the interpretation more consistent. Feel free to do the same.*

### 35.3.4 APA-Style Table of Results

If you want to present the results of your one-way ANOVA to a more statistically inclined audience, particularly an audience that prefers American Psychological Association (APA) style, consider using functions from the `apaTables`

package.

Using the `aov`

function from base R, as we did above, let’s begin by specifying a one-way ANOVA model and naming the model object (`model1`

).

If you haven’t already, install and access the `apaTables`

package (Stanley 2021) using the `install.packages`

and `library`

functions, respectively.

As a precaution, consider installing the `MBESS`

package, as the function we are about to use is dependent on that package. If you don’t have the `MBESS`

package installed, you’ll get an error message when you run the `apa.aov.table`

from `apaTables`

. You may need to re-access the `apaTables`

package using the `library`

function after installing the `MBESS`

package (Kelley 2021).

To create an APA-style table that contains model summary information like the sum of squares, degrees of freedom, *F*-value, and *p*-value, we will use the `apa.aov.table`

function. As the sole argument in the function, type the name of the one-way ANOVA model object you specified above (`model1`

).

```
##
##
## ANOVA results using PostTest as the dependent variable
##
##
## Predictor SS df MS F p partial_eta2 CI_90_partial_eta2
## (Intercept) 130899.24 1 130899.24 1993.59 .000
## Condition 1333.63 2 666.82 10.16 .000 .22 [.08, .34]
## Error 4727.52 72 65.66
##
## Note: Values in square brackets indicate the bounds of the 90% confidence interval for partial eta-squared
```

If you would like to write (export) the table to a Word document (.doc), as a second argument, add `filename=`

followed by whatever you would like to name the file in quotation marks (`" "`

). Make sure you include the .doc file type at the end.

```
# Create APA-style one-way ANOVA model summary table (write to working directory)
apa.aov.table(model1, filename="One-Way ANOVA Summary Table.doc")
```

```
##
##
## ANOVA results using PostTest as the dependent variable
##
##
## Predictor SS df MS F p partial_eta2 CI_90_partial_eta2
## (Intercept) 130899.24 1 130899.24 1993.59 .000
## Condition 1333.63 2 666.82 10.16 .000 .22 [.08, .34]
## Error 4727.52 72 65.66
##
## Note: Values in square brackets indicate the bounds of the 90% confidence interval for partial eta-squared
```

To create a summary table that contains the mean and standard deviation (SD) of the outcome variable for each level of the categorical predictor variable, use the `apa.1way.table`

function. As the first argument, type `iv=`

followed by the name of the categorical predictor (independent) variable. As the second argument, type `dv=`

followed by the name of the categorical outcome (dependent) variable. As the third argument, `data=`

followed by the name of the data frame object to which the predictor and outcome variables belong.

```
##
##
## Descriptive statistics for PostTest as a function of Condition.
##
## Condition M SD
## New 72.36 6.98
## No 62.36 8.09
## Old 69.60 9.11
##
## Note. M and SD represent mean and standard deviation, respectively.
##
```

If you would like to write (export) the table to a Word document (.doc), as a fourth argument, add `filename=`

followed by whatever you would like to name the file in quotation marks (`" "`

). Make sure you include the .doc file type at the end.

```
# Create APA-style means/SDs table (write to working directory)
apa.1way.table(iv=Condition, dv=PostTest, data=td,
filename="Means-SDs Table.doc")
```

```
##
##
## Descriptive statistics for PostTest as a function of Condition.
##
## Condition M SD
## New 72.36 6.98
## No 62.36 8.09
## Old 69.60 9.11
##
## Note. M and SD represent mean and standard deviation, respectively.
##
```

To create a summary table that contains the mean and standard deviation (SD) of the outcome variable for each level of the categorical predictor variable *and* Cohen’s *d* values for pair-wise comparisons, use the `apa.d.table`

function. As the first argument, type `iv=`

followed by the name of the categorical predictor (independent) variable. As the second argument, type `dv=`

followed by the name of the categorical outcome (dependent) variable. As the third argument, `data=`

followed by the name of the data frame object to which the predictor and outcome variables belong.

```
##
##
## Means, standard deviations, and d-values with confidence intervals
##
##
## Variable M SD 1 2
## 1. New 72.36 6.98
##
## 2. No 62.36 8.09 1.32
## [0.70, 1.93]
##
## 3. Old 69.60 9.11 0.34 0.84
## [-0.22, 0.90] [0.26, 1.42]
##
##
## Note. M indicates mean. SD indicates standard deviation. d-values are estimates calculated using formulas 4.18 and 4.19
## from Borenstein, Hedges, Higgins, & Rothstein (2009). d-values not calculated if unequal variances prevented pooling.
## Values in square brackets indicate the 95% confidence interval for each d-value.
## The confidence interval is a plausible range of population d-values
## that could have caused the sample d-value (Cumming, 2014).
##
```

If you would like to write (export) the table to a Word document (.doc), as a fourth argument, add `filename=`

followed by whatever you would like to name the file in quotation marks (`" "`

). Make sure you include the .doc file type at the end.

```
# Create APA-style means/SDs & Cohen's ds table (write to working directory)
apa.d.table(iv=Condition, dv=PostTest, data=td,
filename="Means-SDs & Cohen's ds Table.doc")
```

```
##
##
## Means, standard deviations, and d-values with confidence intervals
##
##
## Variable M SD 1 2
## 1. New 72.36 6.98
##
## 2. No 62.36 8.09 1.32
## [0.70, 1.93]
##
## 3. Old 69.60 9.11 0.34 0.84
## [-0.22, 0.90] [0.26, 1.42]
##
##
## Note. M indicates mean. SD indicates standard deviation. d-values are estimates calculated using formulas 4.18 and 4.19
## from Borenstein, Hedges, Higgins, & Rothstein (2009). d-values not calculated if unequal variances prevented pooling.
## Values in square brackets indicate the 95% confidence interval for each d-value.
## The confidence interval is a plausible range of population d-values
## that could have caused the sample d-value (Cumming, 2014).
##
```

### References

*lessR: Less Code, More Results*. https://CRAN.R-project.org/package=lessR.

*MBESS: The MBESS r Package*. https://CRAN.R-project.org/package=MBESS.

*apaTables: Create American Psychological Association (APA) Style Tables*. https://CRAN.R-project.org/package=apaTables.

*Effsize: Efficient Effect Size Computation*. https://doi.org/10.5281/zenodo.1480624.

*Readr: Read Rectangular Text Data*. https://CRAN.R-project.org/package=readr.