Chapter 54 Investigating Processes Using Path Analysis

In this chapter, we will learn how to apply path analysis in order to investigate processes that influence employees’ performance of a particular behavior.

54.1 Conceptual Overview

In its simplest forms, path analysis can represent a single linear regression model (i.e., one outcome, one predictor), but in its more complex, path analysis can represent a system of multiple equations. Path analysis is often useful for testing theoretical or conceptual models with multiple outcome variables and/or outcome variables that also serve as predictor variables (e.g., mediators). Both conventional multiple linear regression modeling and structural equation modeling can be used for path analysis, and in this tutorial we will focus on the latter approach, as it allows for model estimation in a single step and the assessment of overall model fit to the data across the equations specified in the model.

Path analysis can be used to evaluate models with presumed causal chains of variables, and this is why the approach is sometimes called causal modeling. We should be careful with how we use the term causal modeling, however, as showing that a chain of variables are linked to together in a model does not necessarily mean that the variables are causally related.

In this chapter, we will focus on recursive models, which means the direct relations between two variables presumed to be causally related can only be unidirectional. A nonrecursive model allows for bidirectional associations between variables presumed to be causally related and for a predictor variable to be correlated with the residual error of its presumed outcome variable.

54.1.1 Path Diagram

It is customary to depict a path analysis model visually using a path diagram, and as mentioned above, path analysis can be used to test a theoretical or conceptual model of interest. Let’s use the Theory of Planned Behavior (Ajzen 1991). For a simplified version of the theory, please refer to the Figure 1 below.

Figure 1: Conceptual representation of the Theory of Planned Behavior (Ajzen, 1991)

In a simplified form, the Theory of Planned Behavior posits that an intention to perform a particular behavior influences the individual’s decision to enact that behavior, and attitude toward the behavior, perception of norms pertaining to that behavior, and perception of control over performing that behavior influence the individual’s intention to perform the behavior.

We can specify the Theory of Planned Behavior as a path diagram by first drawing the (manifest, observed) variables and the directional (structural) relations (paths) between the variables as implied by the theory, where rectangles represent the variables and directional arrows represent the directional relations between variables, which is depicted in the path diagram below (see Figure 2).

Figure 2: Path diagram depicting Theory of Planned Behavior

For exogenous variables in the model, on the one hand, we can choose to estimate their variances and the covariances between them; on the other hand, we can choose not to estimate their variances and covariances between them. For our purposes, we’ll practice specifying the variances and covariances between exogenous variables, but note that when missing data are present and full information maximum likelihood is deployed, the decision to estimate the variances of exogenous variables can have additional implications, which go beyond the scope of this tutorial. Exogenous variables serve only as predictor variables in the model, meaning that no other variables in the model to predict them and their causes exist outside of the model. As shown in the figure above, variances are typically represented as curved double-sided arrows, where both arrows point to the same variable. Covariances are also represented as double-sided arrows in which the arrows connect two distinct variables.

(Residual) error terms (which are sometimes called disturbances) are added to endogenous variables, where endogenous variables refer to those variables that are predicted by another variable in the model, meaning that at least one presumed cause is modeled. Note that an endogenous variable may also be specified as the predictor of another endogenous variable, as is the case in our example path diagram representing the Theory of Planned Behavior. In addition, please note that an error term represents variability in an endogenous variable that remains unexplained even after the effects of the specified predictor variables are accounted for. Note, for example, that when we have multiple endogenous variables at the same stage of the model (e.g., first-stage mediators, terminal outcomes), we may choose to allow their error terms to covary.

In sum, the conventional symbols used in a path diagram are shown Figure 3 below. With the exception of the manifest variable symbol, all of these symbols represent components of the model that can be estimated statistically using path analysis. These model components that can (and will) be estimated are commonly referred to as parameters or free parameters.

Figure 3: Conventional path diagram symbols and their meanings

Importantly, please note that in our path diagram because there are no direct relations specified from Attitude to Behavior, Norms to Behavior, and Control to Behavior, which implies that those direct relations are constrained to zero. If we suspected that, in fact, those direct relations are non-zero, then we would draw direction relations (i.e., single-headed arrows) in our path diagram and estimate their values during path analysis.

54.1.2 Model Identification

Model identification has to do with the number of (free) parameters specified in the model relative to the number of unique (non-redundant) sources of information available, and model implication has important implications for assessing model fit and estimating parameter estimates.

Just-identified: In a just-identified model (i.e., saturated model), the number of parameters (e.g., structural relations, variances) is equal to the number of unique (non-redundant) sources of information, which means that the degrees of freedom (df) is equal to zero. In just-identified models, the model parameters can be estimated, but the model fit cannot be assessed in a meaningful way, aside from the R2 value. As a specific applications of path analysis, simple linear and multiple linear regression models are always just-identified.

Over-identified: In an over-identified model, the number of parameters (e.g., structural relations, variances) is less than the number of unique (non-redundant) sources of information, which means that the degrees of freedom (df) is greater than zero. In over-identified models, the model parameters can be estimated, and the model fit can be assessed.

Under-identified: In an under-identified model, the number of parameters (e.g., structural relations, variances) is greater than the number of unique (non-redundant) sources of information, which means that the degrees of freedom (df) is less than zero. In under-identified models, the model parameters and model fit cannot be estimated. Sometimes we might say that such models are called overparameterized because they have more parameters to be estimated than unique (non-redundant) sources of information.

Most (if not all) statistical software packages that allow structural equation modeling – and, by extension, path analysis – to automatically compute the degrees of freedom for a model or provide an error message if the model is under-identified. As such, we don’t need to count the number of sources of unique (non-redundant) sources of information and free parameters by hand. With that said, to understand model identification and its various forms at a deeper level, it is often helpful to practice calculating the degrees freedom by hand when first learning.

The formula for calculating the number of unique (non-redundant) sources of information available for a particular model is as follows:

\(i = \frac{p(p+1)}{2}\)

where \(p\) is the number of manifest (observed) variables to be modeled. This formula calculates the number of possible unique covariances and variances for the variables specified in the model – or in other words, it calculates the lower diagonal of a covariance matrix, including the variances.

In the path diagram we specified above, there are five manifest variables: Attitude, Norms, Control, Intention, and Behavior. Thus, in the following formula, \(p\) is equal to 5, and thus the number of unique (non-redundant) sources of information is 15.

\(i = \frac{5(5+1)}{2} = \frac{30}{2} = 15\)

To count the number of free parameters (\(k\)), simply add up the number of the specified direction relations, variances, covariances, and error terms in the path analysis model. As shown in Figure 4 below, our example path analysis model has 12 free parameters.

\(k = 12\)

To calculate the degrees of freedom (df) for the model, subtract the number of free parameters from the number unique (non-redundant) sources of information, which in this example is equal to 3, as shown below. Thus, the degrees of freedom for the model is 3, which means the model is over-identified.

\(df = i - k = 15 - 12 = 3\)

Figure 4: Counting the number of (free) parameters in specified path analysis model

54.1.3 Model Fit

When a model is over-identified (df > 0), the extent to which the specified model fits the data can be assessed using various model fit indices, such as the chi-square test (\(\chi^{2}\)), comparative fit index (CFI), Tucker-Lewis index (TLI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). For a commonly cited reference on cutoffs for fit indices, please refer to Hu and Bentler (1999).

Chi-square test: The chi-square test can be used to assess whether the model fits the data, where a statistically significant chi-square value (e.g., p < .05) indicates that the model does not fit the data well and a nonsignificant chi-square value (e.g., p > .05) indicates that the model fits the data reasonably well. The null hypothesis for the chi-square test is that the model fits the data perfectly, and thus failing to reject the null model provides some confidence that the model fits the data reasonably close to perfectly. The chi-square test is sensitive to sample size and non-normal variable distributions.

Comparative fit index (CFI): As the name implies, the comparative fit index (CFI) is a type of comparative (or incremental) fit index, which means that the CFI compares the focal model to a baseline model, which is commonly referred to as the null or independence model. The CFI is generally less sensitive to sample size than the chi-square test. A CFI value greater than or equal to .90 generally indicates good model fit to the data.

Tucker-Lewis index (CFI): Like the CFI, the Tucker-Lewis index (TLI) is another type of comparative (or incremental) fit index. The TLI is generally less sensitive to sample size than the chi-square test and tends to work well with smaller sample sizes. A TLI value greater than or equal to .95 generally indicates good model fit to the data, although some might relax that cutoff to .90.

Root mean square error of approximation (RMSEA): The root mean square error of approximation (RMSEA) is an absolute fit index that penalizes model complexity (e.g., models with a larger number of estimated parameters) and thus ends up effectively rewarding more parsimonious models. RMSEA values tend to upwardly biased when the model degrees of freedom are fewer (i.e., when the model is closer to being just-identified). In general, an RMSEA value that is less than or equal to .08 indicates good model fit to the data, although some relax that cutoff to .10.

Standardized root mean square residual (SRMR): Like the RMSEA, the standardized root mean square residual (SRMR) is an example of an absolute fit index. An SRMR value that is less than or equal to .08 generally indicates good fit to the data.

Summary of model fit indices: The conventional cutoffs for the aforementioned model fit indices – like any rule of thumb – should be applied with caution and with good judgment and intention. Further, these indices don’t always agree with one another, which means that we often look across multiple fit indices and come up with our best judgment of whether the model adequately fits the data. Generally, it is not advisable to interpret model parameter estimates unless the model fits the data adequately. Below is a table of the conventional cutoffs for the model fit indices.

Fit Index Cutoff for Adequate Fit
\(\chi^{2}\) \(\ge .05\)
CFI \(\ge .90\)
TLI \(\ge .95\)
RMSEA \(\le .08\)
SRMR \(\le .08\)

54.1.4 Parameter Estimates

In path analysis, parameter estimates (e.g., direct relations, covariances, variances) can be interpreted like those from a regression model, where the associated p-values or confidence intervals can be used as indicators of statistical significance.

54.1.5 Statistical Assumptions

The statistical assumptions that should be met prior to running and/or interpreting a path analysis model overlap with those associated with multiple linear regression, and thus for a review of those assumptions please see the chapter on incremental validity using multiple linear regression.

54.1.6 Conceptual Video

For a more in-depth review of path analysis, please check out the following conceptual video.

Link to conceptual video: https://youtu.be/UGIVPtFKwc0

54.2 Tutorial

This chapter’s tutorial demonstrates estimate a path analysis model using R.

54.2.1 Video Tutorial

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch the video tutorial below.

Link to video tutorial: https://youtu.be/vMStRfsUTBg

54.2.2 Functions & Packages Introduced

Function Package
sem lavaan
summary base R
lm base R

54.2.3 Initial Steps

If you haven’t already, save the file called “PlannedBehavior.csv” into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., "H:/RWorkshop"). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the setwd function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to Session > Set Working Directory > Choose Directory…. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and Creating & Saving an R Script.

# Set your working directory
setwd("H:/RWorkshop")

Next, read in the .csv data file called “PlannedBehavior.csv” using your choice of read function. In this example, I use the read_csv function from the readr package (Wickham, Hester, and Bryan 2023). If you choose to use the read_csv function, be sure that you have installed and accessed the readr package using the install.packages and library functions. Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months. For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

# Install readr package if you haven't already
# [Note: You don't need to install a package every 
# time you wish to access it]
install.packages("readr")
# Access readr package
library(readr)

# Read data and name data frame (tibble) object
df <- read_csv("PlannedBehavior.csv")
## Rows: 199 Columns: 5
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): attitude, norms, control, intention, behavior
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the names of the variables in the data frame (tibble) object
names(df)
## [1] "attitude"  "norms"     "control"   "intention" "behavior"
# Print variable type for each variable in data frame (tibble) object
str(df)
## spc_tbl_ [199 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ attitude : num [1:199] 2.31 4.66 3.85 4.24 2.91 2.99 3.96 3.01 4.77 3.67 ...
##  $ norms    : num [1:199] 2.31 4.01 3.56 2.25 3.31 2.51 4.65 2.98 3.09 3.63 ...
##  $ control  : num [1:199] 2.03 3.63 4.2 2.84 2.4 2.95 3.77 1.9 3.83 5 ...
##  $ intention: num [1:199] 2.5 3.99 4.35 1.51 1.45 2.59 4.08 2.58 4.87 3.09 ...
##  $ behavior : num [1:199] 2.62 3.64 3.83 2.25 2 2.2 4.41 4.15 4.35 3.95 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   attitude = col_double(),
##   ..   norms = col_double(),
##   ..   control = col_double(),
##   ..   intention = col_double(),
##   ..   behavior = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Print first 6 rows of data frame (tibble) object
head(df)
## # A tibble: 6 × 5
##   attitude norms control intention behavior
##      <dbl> <dbl>   <dbl>     <dbl>    <dbl>
## 1     2.31  2.31    2.03      2.5      2.62
## 2     4.66  4.01    3.63      3.99     3.64
## 3     3.85  3.56    4.2       4.35     3.83
## 4     4.24  2.25    2.84      1.51     2.25
## 5     2.91  3.31    2.4       1.45     2   
## 6     2.99  2.51    2.95      2.59     2.2
# Print number of rows in data frame (tibble) object
nrow(df)
## [1] 199

There are 5 variables and 199 cases (i.e., employees) in the df data frame: attitude, norms, control, intention, and behavior. Per the output of the str (structure) function above, all of the variables are of type numeric (continuous: interval/ratio). All of the variables are self-reports from a survey, where respondents rated their level of agreement (1 = strongly disagree, 5 = strongly agree) for the items associated with the attitude, norms, control, and intention scales, and where respondents rated the frequency with which they enact the behavior associated with the behavior variable (1 = never, 5 = all of the time). The attitude variable reflects an employee’s attitude toward the behavior in question. The norms variable reflects an employee’s perception of norms pertaining to the enactment of the behavior. The control variable reflects an employee’s feeling of control over being able to perform the behavior. The intention variable reflects an employee’s intention to enact the behavior. The behavior variable reflects an employee’s perception of the frequency with which they actually engage in the behavior.

54.2.4 Specify & Estimate Path Analysis Models

We will use functions from the lavaan package (Rosseel 2012) to specify and estimate our path analysis model. The lavaan package also allows structural equation modeling with latent variables, but we won’t cover that in this tutorial. If you haven’t already install and access the lavaan package using the install.packages and library functions, respectively. For background information on the lavaan package, check out the package website.

# Install package
install.packages("lavaan")
# Access package
library(lavaan)

Let’s begin by showing how a multiple linear regression model can be estimated using lavaan and it’s functions. Specifically, let’s focus on the first part of our Theory of Planned Behavior model, where attitude, norms, and control are proposed as predictors intention.

  1. Using the <- operator, name the specified model object something of your choosing (specmod). To the right of the <- operator, enter quotation marks (" "), and within them, specify your model, which in this case is a multiple linear regression equation: intention ~ attitude + norms + control. Just like we would with other regression functions in R, we use the tilde (~) to separate our outcome variable from our predictor variable, where the outcome variable goes to the left of the tilde, and the predictor variables go to the right.
  2. Using the sem function from lavaan, type the name of the specified regression model (specmod) as the first argument and data= followed by the name of the data from which the variables belong as the second argument. Using the <- operator, name the estimated model object something, and here I name it fitmod for fitted model.
  3. Using the summary function from base R, request a summary of the model fit and parameter estimate results.
  • As the first argument in the summary function, type the name of the estimated model function from the previous step (fitmod).
  • As the second argument, type fit.measures=TRUE to request the model fit indices as part of the output.
  • As the third argument, type rsquare=TRUE to request the unadjusted R2 value for the model.
# Specify path analysis model
specmod <- "
intention ~ attitude + norms + control
"

# Estimate model
fitmod <- sem(specmod, data=df)

# Request summary of results
summary(fitmod, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6.15 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         4
## 
##   Number of observations                           199
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                                91.633
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -219.244
##   Loglikelihood unrestricted model (H1)       -219.244
##                                                       
##   Akaike (AIC)                                 446.489
##   Bayesian (BIC)                               459.662
##   Sample-size adjusted Bayesian (SABIC)        446.990
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value H_0: RMSEA <= 0.050                       NA
##   P-value H_0: RMSEA >= 0.080                       NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   intention ~                                         
##     attitude          0.352    0.058    6.068    0.000
##     norms             0.153    0.059    2.577    0.010
##     control           0.275    0.058    4.740    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .intention         0.530    0.053    9.975    0.000
## 
## R-Square:
##                    Estimate
##     intention         0.369

Towards the top of the output, alongside “Number of observations”, the number 199 appears, which indicates that 199 employees’ data were included in this analysis. Further below, you will see a zero (0) next to “Degrees of freedom”, which indicates that the model is just-identified. Thus, we will ignore the model fit information and skip down to the “Regressions” section. The path coefficient (i.e., regression coefficient) between attitude and intention is statistically significant and positive (b = .352, p < .001). The path coefficient between norms and intention is statistically significant and positive (b = .153, p = .010). The path coefficient between control and intention is statistically significant and positive (b = .275, p < .001). Under the “R-Square” section, we see the unadjusted R2 value for the outcome variable intention, which is equal to .369; that is, collectively, attitude, norms, and control explain 36.9% of the variance in intention.

We can also explicitly model the covariances/correlations between the predictor (exogenous) variables in the model by using the double tilde (~~) operator. Note that on separate lines under the regression equation script, we can specify that we want attitude to be able to freely covary with norms and control and for norms to be able to freely covary with control. Everything else in your script remains the same as in our previously specified model. As noted above, in a real-life situation, we should think carefully about what it means to add these covariances to exogenous variables, particularly in the case of estimating the model using full information maximum likelihood and in the presence of missing data. We are specifying these covariances here for demonstration purposes.

# Specify path analysis model
specmod <- "
# Direct relations
intention ~ attitude + norms + control
# Covariances
attitude ~~ norms + control
norms ~~ control
"

# Estimate model
fitmod <- sem(specmod, data=df)

# Request summary of results
summary(fitmod, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6.15 ended normally after 18 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        10
## 
##   Number of observations                           199
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                               136.306
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1011.828
##   Loglikelihood unrestricted model (H1)      -1011.828
##                                                       
##   Akaike (AIC)                                2043.656
##   Bayesian (BIC)                              2076.589
##   Sample-size adjusted Bayesian (SABIC)       2044.908
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value H_0: RMSEA <= 0.050                       NA
##   P-value H_0: RMSEA >= 0.080                       NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   intention ~                                         
##     attitude          0.352    0.058    6.068    0.000
##     norms             0.153    0.059    2.577    0.010
##     control           0.275    0.058    4.740    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   attitude ~~                                         
##     norms             0.200    0.064    3.128    0.002
##     control           0.334    0.070    4.748    0.000
##   norms ~~                                            
##     control           0.220    0.065    3.411    0.001
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .intention         0.530    0.053    9.975    0.000
##     attitude          0.928    0.093    9.975    0.000
##     norms             0.830    0.083    9.975    0.000
##     control           0.939    0.094    9.975    0.000
## 
## R-Square:
##                    Estimate
##     intention         0.369

Just as before, the degrees of freedom (df) for the model is equal to zero (0), which again indicates that this model is just-identified, which is what we would expect from a multiple linear regression model as well. In addition, note that the path coefficients remain the same (along with their associated p-values), and the R2 value remains the same. In the new, output, however, the covariances are estimated, and when we estimate the covariances between variables, the variances of the variables involved in the covariances are also estimated by default. Often the covariances and variances are not of substantive interest when interpreting a path analysis model, though.

Using the lm (linear model) function from base R we can verify that our path analysis results using the sem function from lavaan are equivalent to multiple linear regression results from a standard linear regression function. For a review of the lm function, please refer to the chapter supplement for estimating incremental validity using multiple linear regression.

# Estimate multiple linear regression model
fitmod <- lm(intention ~ attitude + norms + control, data=df)

# Request summary of results
summary(fitmod)
## 
## Call:
## lm(formula = intention ~ attitude + norms + control, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80282 -0.52734 -0.06018  0.51228  1.85202 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)  0.58579    0.23963   2.445        0.0154 *  
## attitude     0.35232    0.05866   6.006 0.00000000913 ***
## norms        0.15250    0.05979   2.550        0.0115 *  
## control      0.27502    0.05862   4.692 0.00000509027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7356 on 195 degrees of freedom
## Multiple R-squared:  0.369,  Adjusted R-squared:  0.3593 
## F-statistic: 38.01 on 3 and 195 DF,  p-value: < 0.00000000000000022

As you can see, aside from the number of digits reported after the decimal, the lm regression coefficients are equivalent to the sem path coefficients. In addition, the unadjusted R2 value is the same.

Now it’s time to test a full model of the Theory of Planned Behavior by adding the equation in which intention predicts behavior. We will build on our previous path analysis model script by specifying that intention is a predictor of behavior (i.e., behavior ~ intention). Everything else in our script can remain the same.

# Specify path analysis model
specmod <- "
# Direct relations
intention ~ attitude + norms + control
behavior ~ intention
# Covariances
attitude ~~ norms + control
norms ~~ control
"

# Estimate model
fitmod <- sem(specmod, data=df)

# Request summary of results
summary(fitmod, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6.15 ended normally after 18 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        12
## 
##   Number of observations                           199
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 2.023
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.568
## 
## Model Test Baseline Model:
## 
##   Test statistic                               182.295
##   Degrees of freedom                                10
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.019
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1258.517
##   Loglikelihood unrestricted model (H1)      -1257.506
##                                                       
##   Akaike (AIC)                                2541.035
##   Bayesian (BIC)                              2580.555
##   Sample-size adjusted Bayesian (SABIC)       2542.538
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.103
##   P-value H_0: RMSEA <= 0.050                    0.735
##   P-value H_0: RMSEA >= 0.080                    0.120
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.019
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   intention ~                                         
##     attitude          0.352    0.058    6.068    0.000
##     norms             0.153    0.059    2.577    0.010
##     control           0.275    0.058    4.740    0.000
##   behavior ~                                          
##     intention         0.453    0.065    7.014    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   attitude ~~                                         
##     norms             0.200    0.064    3.128    0.002
##     control           0.334    0.070    4.748    0.000
##   norms ~~                                            
##     control           0.220    0.065    3.411    0.001
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .intention         0.530    0.053    9.975    0.000
##    .behavior          0.699    0.070    9.975    0.000
##     attitude          0.928    0.093    9.975    0.000
##     norms             0.830    0.083    9.975    0.000
##     control           0.939    0.094    9.975    0.000
## 
## R-Square:
##                    Estimate
##     intention         0.369
##     behavior          0.198

Again, we have 199 observations (employees) used in this analysis, but this time, the degrees of freedom is equal to 3, which indicates that our model is over-identified. Because the model is over-identified, we can interpret the model fit indices to assess the how well the model we specified fits the data. First, the chi-square test value appears to the right of the text “Model Fit Test Statistic”, and the associated p-value appears below. The chi-square test is nonsignificant (\(\chi^{2}(df=3)=2.023, p=.568\)), which indicates that the model does not fit significantly worse than a model that fits the data perfectly; thus, we have our first indicator that the model fits the data reasonably well. Second, the comparative fit index (CFI) is 1.000, which is greater than the conventional .90 cutoff, which also indicates that the model fits the data reasonably well. Third, the Tucker-Lewis index (TLI) is 1.019, which is greater than the conventional .95 cutoff, which also indicates that the model fits the data reasonably well. Fourth, the root mean square error of approximation (RMSEA) value of .000 is less than the conventional cutoff of .08, and thus we have further evidence that the model fits the data reasonably well. Fifth, the standardized root mean square residual (SRMR) value of .019 is less than the conventional cutoff of .08, and thus we have further evidence that the model fits the data reasonably well. This is one of the relatively rare occasions in which all of the model fit indices are in agreement with one another, leading us to conclude that the specified model fits the data well. We can now feel comfortable proceeding forward interpreting the parameter estimates.

The path coefficient (i.e., direct relation, presumed causal relation) between attitude and intention is statistically significant and positive (b = .352, p < .001); the path coefficient between norms and intention is also statistically significant and positive (b = .153, p = .010); and the path coefficient between control and intention is also statistically significant and positive (b = .275, p < .001). Further, the path coefficient between intention and behavior is statistically significant and positive (b = .453, p < .001). The covariances and variances are not of substantive interest, so we will ignore them. The unadjusted R2 value for the outcome variable intention is .369, which indicates that, collectively, attitude, norms, and control explain 36.9% of the variance in intention, and the unadjusted R2 value for the outcome variable behavior is .198, which indicates that intention explains 19.8% of the variance in behavior. Overall, the results lends support to the propositions of the Theory of Planned Behavior, at least based on this sample of employees.

54.2.5 Additional Information on Model Specification Notation

Thus far we have focused on using the tilde (~) operator to note directional relations (i.e., path coefficients) and the double tilde (~~) to specify a covariance. The plus (+) operator allows us to add variables to one side of the directional relation or covariance equation.

There are equivalent approaches for specifying directional relations. If we want to specify that attitude, norms, and control are predictors of intention we could use either approach shown below when specifying our model.

# Equivalent approaches to specifying directional relations

### Approach 1
specmod <- "
intention ~ attitude + norms + control
"

### Approach 2
specmod <- "
intention ~ attitude
intention ~  norms 
intention ~ control
"

The same logical applies if we hypothetically wanted to specify attitude, norms, and control as predictors of both intention and behavior. It’s really up to you how you decide to specify your model when equivalent approaches are possible.

# Equivalent approaches to specifying directional relations

### Approach 1
specmod <- "
intention + behavior ~ attitude + norms + control
"

### Approach 2
specmod <- "
intention ~ attitude + norms + control
behavior ~ attitude + norms + control
"

### Approach 3
specmod <- "
intention ~ attitude
intention ~  norms 
intention ~ control
behavior ~ attitude
behavior ~  norms 
behavior ~ control
"

Further, there are equivalent approaches for specifying covariances. If we want to specify that attitude, norms, and control are permitted to covary with one another, we could use either approach below.

# Equivalent approaches to specifying covariances

### Approach 1
specmod <- "
attitude ~~ norms + control
norms ~~ control
"

### Approach 2
specmod <- "
attitude ~~ norms
attitude ~~ control
norms ~~ control
"

If you would like to explicitly specify a variance component in your model (even if the model will do this by default for variables with specified covariances), you would use the double tilde (~~) operator with the same variable’s name on either side of the double tilde.

# Specifying variances
specmod <- "
attitude ~~ attitude
norms ~~ norms
control ~~ control
"

Finally, if you have a single exogenous variable in your model (that does not share a covariance with any other variable), it is up to you whether you wish to specify the variance component for that exogenous variable as a free parameter (or not). The model fit and parameter estimates will remain the same.

54.2.6 Summary

In this chapter, we explored the building blocks of path analysis, which allowed us to simultaneously fit a model with more than one outcome variable and with a variable that acts as both a predictor and an outcome. To do so, we used the sem function from the lavaan package and evaluated model fit indices and parameter estimates.

References

Ajzen, Icek. 1991. “The Theory of Planned Behavior.” Organizational Behavior and Human Decision Processes 50: 179–211.
Hu, Li-tze, and Peter M Bentler. 1999. “Cutoff Criteria for Fit Indexes in Covariance Structure Analysis: Conventional Criteria Versus New Alternatives.” Structural Equation Modeling 6 (1): 1–55.
Rosseel, Yves. 2012. lavaan: An R Package for Structural Equation Modeling.” Journal of Statistical Software 48 (2): 1–36. https://www.jstatsoft.org/v48/i02/.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2023. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.