Full Information Maximum Likelihood

FIML for Missing Data in `lavaan`

Descriptive Statistics

Full information maximum likelihood (FIML) is a modern statistical technique for handling missing data. If you are not familiar with FIML, I would recommend the book entitled Applied Missing Data Analysis by Craig Enders. The book is both thorough and accessible, and a good place to start for those not familiar with the ins and outs of modern missing data techniques.

The purpose of this FIML in Lavaan section and the related git repository is to take some of the examples related to FIML estimation within a regression framework from the Applied Missing Data website, and translate them into code for the R package lavaan. The code on the Applied Missing Data website in mostly for Mplus, which is quite expensive software. I hope this will give those who don’t have access to Mplus the ability to work through the examples using free and open source software.

In this first subsection I start with the basics: how to get descriptive statistics using FIML. The data and Mplus code for this example can be found on the Book Examples page of the Applied Missing Data website. I also created a github repository with the data and R files with equivalent code in lavaan, which can be found here. Remember to replace the file path in the R code below with the file path to the folder in which you unzip the data files.

You will also want to read over the lavaan documentation and visit the very helpful lavaan website to take advantage of the tutorials there. With these resources at your disposal, you should be able to use replicate the examples in lavaan. Here, I walk through the major sections of the R code. This is the same code found in the github repository in the R file entitled FIMLdescriptivesCorrelations.R.

Header

I always include a header with basic information in my code files.

#-----------------------------------------------------------------------
# section 4.14 Summary Statistics 
# Author: William M. Murrah
# Description: This code replicates the section 4.14 example on the 
#              the appliedmissingdata.com website, which generates 
#              descriptive statistics and correlations,
# Version history ------------------------------------------------------
# 2014.05.30: code created
# 2014.06.01: rewrote heading
#-----------------------------------------------------------------------
# R packages used
library(lavaan)

Import and prepare data

First, import the data into R. MPlus uses .dat files which can only contain numbers. Variable names are not included in the .dat file, but instead are included in the Mplus .inp file. I use the read.table function to read the .dat file.

    employee <- read.table("data/employee.dat")

Next, I assign names to the variables in the new data frame.

    # Assign names to variables.
    names(employee) <- c("id", "age", "tenure", "female", "wbeing", 
                         "jobsat", "jobperf", "turnover", "iq")

The final step in preparing the data is to recode the data values -99, which are used as missing data values in the .dat file, to NA, which is the missing value indicator in R.

    # Replace all missing values (-99) with R missing value character 'NA'.
    employee[employee==-99] <- NA

Create Model Object

Now that the data are ready, I create a character string with the model using the lavaan syntax. For descriptives and correlations I model the mean, variances, and covariance/correlations.

    # Create descriptive model object
    model <- '
    # means
    age      ~ 1
    tenure   ~ 1
    female   ~ 1
    wbeing   ~ 1
    jobsat   ~ 1
    jobperf  ~ 1
    turnover ~ 1
    iq       ~ 1
    
    # variances
    age      ~~ age
    tenure   ~~ tenure
    female   ~~ female
    wbeing   ~~ wbeing
    jobsat   ~~ jobsat
    jobperf  ~~ jobperf
    turnover ~~ turnover
    iq       ~~ iq
    
    # covariances/correlations
    age      ~~ tenure + female + wbeing + jobsat + jobperf + turnover + iq
    tenure   ~~ female + wbeing + jobsat + jobperf + turnover + iq
    female   ~~ wbeing + jobsat + jobperf + turnover + iq
    wbeing   ~~ jobsat + jobperf + turnover + iq
    jobsat   ~~ jobperf + turnover + iq
    jobperf  ~~ turnover + iq
    turnover ~~ iq
    '

Fit the Model

To fit the model, I use the lavaan sem function. This function takes the first two argument model and data. The third argument is missing ='fiml', which tells lavaan to use FIML (the default is to use listwise deletion).

    fit <- sem(model, employee, missing='fiml')

Alternatively, you could leave the section of the model code under the # means section and use the meanstructure=TRUE argument in the fit function as follows, which give the same results:

    fit <- sem(model, employee, missing='fiml', meanstructure=TRUE)

Generate Output

To print the results to the console, use the summary function.

    summary(fit, fit.measures=TRUE, standardize=TRUE)

The fit.measures=TRUE calls fit statistics in the output. This should look familiar to those who have used Mplus.

lavaan (0.5-16) converged normally after 141 iterations

  Number of observations                           480

  Number of missing patterns                         3

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0
  P-value (Chi-square)                           1.000

Model test baseline model:

  Minimum Function Test Statistic              527.884
  Degrees of freedom                                28
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -6621.805
  Loglikelihood unrestricted model (H1)      -6621.805

  Number of free parameters                         44
  Akaike (AIC)                               13331.609
  Bayesian (BIC)                             13515.256
  Sample-size adjusted Bayesian (BIC)        13375.604

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent Confidence Interval          0.000  0.000
  P-value RMSEA <= 0.05                          1.000

Standardized Root Mean Square Residual:

  SRMR                                           0.000

The standardize=TRUE argument includes columns with standardized output. the std.all column in lavaan output is the same as the STDYX section in Mplus.

Parameter estimates:

  Information                                 Observed
  Standard Errors                             Standard

                   Estimate  Std.err  Z-value  P(>|z|)   Std.lv  Std.all
Covariances:
  age ~~
    tenure            8.459    0.858    9.865    0.000    8.459    0.504
    female           -0.028    0.122   -0.229    0.819   -0.028   -0.010
    wbeing            1.148    0.334    3.433    0.001    1.148    0.182
    jobsat            0.861    0.340    2.531    0.011    0.861    0.136
    jobperf          -0.330    0.308   -1.072    0.284   -0.330   -0.049
    turnover         -0.377    0.116   -3.255    0.001   -0.377   -0.150
    iq                0.674    2.066    0.326    0.744    0.674    0.015
  tenure ~~
    female           -0.052    0.071   -0.736    0.462   -0.052   -0.034
    wbeing            0.569    0.195    2.916    0.004    0.569    0.155
    jobsat            0.565    0.200    2.822    0.005    0.565    0.154
    jobperf           0.061    0.178    0.344    0.731    0.061    0.016
    turnover          0.016    0.066    0.240    0.810    0.016    0.011
    iq                0.026    1.199    0.022    0.983    0.026    0.001
  female ~~
    wbeing            0.067    0.031    2.156    0.031    0.067    0.115
    jobsat            0.028    0.031    0.881    0.378    0.028    0.047
    jobperf          -0.009    0.029   -0.323    0.747   -0.009   -0.015
    turnover          0.001    0.011    0.114    0.909    0.001    0.005
    iq                0.284    0.192    1.481    0.139    0.284    0.068
  wbeing ~~
    jobsat            0.446    0.095    4.714    0.000    0.446    0.322
    jobperf           0.671    0.084    8.030    0.000    0.671    0.456
    turnover         -0.141    0.030   -4.768    0.000   -0.141   -0.257
    iq                2.876    0.530    5.430    0.000    2.876    0.291
  jobsat ~~
    jobperf           0.271    0.080    3.378    0.001    0.271    0.184
    turnover         -0.129    0.030   -4.248    0.000   -0.129   -0.234
    iq                4.074    0.566    7.195    0.000    4.074    0.411
  jobperf ~~
    turnover         -0.203    0.028   -7.168    0.000   -0.203   -0.346
    iq                4.496    0.523    8.588    0.000    4.496    0.426
  turnover ~~
    iq               -0.706    0.182   -3.872    0.000   -0.706   -0.180

Intercepts:
    age              37.948    0.245  154.633    0.000   37.948    7.058
    tenure           10.054    0.142   70.601    0.000   10.054    3.222
    female            0.542    0.023   23.817    0.000    0.542    1.087
    wbeing            6.288    0.062  100.701    0.000    6.288    5.349
    jobsat            5.950    0.063   94.052    0.000    5.950    5.053
    jobperf           6.021    0.057  105.262    0.000    6.021    4.805
    turnover          0.321    0.021   15.058    0.000    0.321    0.687
    iq              100.102    0.384  260.475    0.000  100.102   11.889

Variances:
    age              28.908    1.866                     28.908    1.000
    tenure            9.735    0.628                      9.735    1.000
    female            0.248    0.016                      0.248    1.000
    wbeing            1.382    0.107                      1.382    1.000
    jobsat            1.386    0.108                      1.386    1.000
    jobperf           1.570    0.101                      1.570    1.000
    turnover          0.218    0.014                      0.218    1.000
    iq               70.892    4.576                     70.892    1.000

Recall that correlations are standardized covariances, so correlations are found in the std.all column in the Covariances section. Also, intercepts are means, and can be interpreted as the FIML means for the variables.

Finally, to get the missing data patterns and covariance coverage output that can be included in Mplus output use the following code:

    # Get missing data patterns and covariance coverage similar
    # to that found in Mplus output.
    inspect(fit, 'patterns') 
    inspect(fit, 'coverage')

which leads to the following output:

Missing Data Patterns

    age tenure female wbeing jobsat jobprf turnvr iq
160   1      1      1      1      1      1      1  1
160   1      1      1      1      0      1      1  1
160   1      1      1      0      1      1      1  1

Covariance Coverage


         age   tenure female wbeing jobsat jobprf turnvr iq   
age      1.000                                                
tenure   1.000 1.000                                          
female   1.000 1.000  1.000                                   
wbeing   0.667 0.667  0.667  0.667                            
jobsat   0.667 0.667  0.667  0.333  0.667                     
jobperf  1.000 1.000  1.000  0.667  0.667  1.000              
turnover 1.000 1.000  1.000  0.667  0.667  1.000  1.000       
iq       1.000 1.000  1.000  0.667  0.667  1.000  1.000  1.000

Regression Analysis

Import Data

In this subsection I use FIML to deal with missing data in a multiple regression framework. First, I import the data from a text file named ‘employee.dat’. You can download a zip file of the data from Applied Missing Data website. I also have a github page for these examples here. Remember to replace the file path in the read.table function with the path to the text file location on your computer.

employee <- read.table("data/employee.dat")

Because the original text file does not include variable names, I name the variables in the new data frame:

names(employee) <-  c("id", "age", "tenure", "female", "wbeing", "jobsat", 
                     "jobperf", "turnover", "iq")

then I recode all data points with the value of -99 in the original text file, which indicates a missing value, to NA, the missing data value recognized by R.

employee[employee == -99] <-  NA

Create Regression Model Object

Now we are ready to create a character string containing the regression model using the lavaan model conventions. Note that b1 and b2 are labels that will be used later for the Wald test. These labels are equivalent to (b1) and (b2) after these variables in the Mplus code.

model <- '
# Regression model 
jobperf ~ b1*wbeing + b2*jobsat

# Variances
wbeing ~~ wbeing
jobsat ~~ jobsat

# Covariance/correlation
wbeing ~~ jobsat
'

In addition to the regression model, I also estimated the variances and covariances of the predictors. I did this to replicate the results of the original Mplus example. In Mplus you have to estimate the variances of all of the predictors if any of them have missing data that you would like to model. In lavaan the fixed.x=FALSE argument has the same effect (see below).

Fit the Model

Next, I use the sem function to fit the model.

fit <- sem(model, employee, missing='fiml', meanstructure=TRUE, 
           fixed.x=FALSE)

Listwise deletion is the default, so the missing=‘fiml’ argument tell lavaan to use the FIML instead. I also included the meanstructure=TRUE argument to include the means of the observed variables in the model, and the fixed.x=FALSE argument to estimate the means, variances, and covariances. Again, I do this to replicate the results of the original Mplus example.

Generate Output

We are now ready to look at the results.

summary(fit, fit.measures=TRUE, rsquare=TRUE, standardize=TRUE)

Compared to what we learned in the last section, the only thing new to the summary function is the rsquare=TRUE argument, which, not surprisingly, results in the model R² being included in the summary output.

I only show the Parameter estimates section here:

Parameter estimates:

  Information                                 Observed
  Standard Errors                             Standard

                   Estimate  Std.err  Z-value  P(&gt;|z|)   Std.lv  Std.all
Regressions:
  jobperf ~
    wbeing   (b1)     0.476    0.055    8.665    0.000    0.476    0.447
    jobsat   (b2)     0.027    0.060    0.444    0.657    0.027    0.025

Covariances:
  wbeing ~~
    jobsat            0.467    0.098    4.780    0.000    0.467    0.336

Intercepts:
    jobperf           2.869    0.382    7.518    0.000    2.869    2.289
    wbeing            6.286    0.063   99.692    0.000    6.286    5.338
    jobsat            5.959    0.065   91.836    0.000    5.959    5.055

Variances:
    wbeing            1.387    0.108                      1.387    1.000
    jobsat            1.390    0.109                      1.390    1.000
    jobperf           1.243    0.087                      1.243    0.792

R-Square:

    jobperf           0.208

Wald Test

In lavaan the Wald test is called separately from the estimation function. This function will use the labels assigned in the model object above.

# Wald test is called seperately.
lavTestWald(fit,  constraints='b1 == 0
                               b2 == 0')

Results of Wald Test

$stat
[1] 95.88081

$df
[1] 2

$p.value
[1] 0

There you have it! Regression with FIML in R. But, what if you have variables that you are not interested in incorporating in your model, but may have information about the missingness in the variables that are in your model? I will talk about that in the next subsection.

Regression Analysis with Auxiliary Variables

Next I demonstrate two methods of using auxiliary variable in a regression model with FIML. Again, I am using data and examples from Craig Ender’s website Applied Missing Data. The purpose of these sections is to make the examples on Craig’s website, which uses Mplus, available to those who prefer to use lavaan

Mplus allows you to use auxiliary variable when using FIML to include variables that help estimate missing values with variables that are not part of the analytic model. There may be variables that are correlated with variables with missing values or variables that are predictive of missing. However, these auxiliary variable are not part of the model you wish to estimate. See Craig’s book Applied Missing Data Analysis for more information about auxiliary variables.

I attended a workshop where Craig showed us how to use the auxiliary command in Mplus to make use of auxiliary variables. However, lavaan does not have this option. He also showed us what he called a ‘brute force’ method to include auxiliary variables in Mplus. Here is how to do it in lavaan.

Brute Force Method

This model is the same as used in my last section, where job performance (jobperf) is regressed on wellbeing (wbeing) and job satisfaction (jobsat). In this example these three variables are the only ones which we want to model. However, tenure and IQ are related to missingness in these variables. So, we want to use them to help us better estimate our model of interest. If we included them as predictors in the regression model, it would allow us to use all the available information in these five variables, but it would change the model substantially. We can use auxiliary variables to better estimate the original model.

Import Data

First we import data, name the variables, and recode the -99’s to NA.

# employeeAuxiliary.R ---------------------------------------------------

# R packages used
library(lavaan)
# Import text file into R as a data frame.

employee <- read.table("path/to/file/employee.dat")

# Assign names to variables.

names(employee) <- c("id", "age", "tenure", "female", "wbeing", "jobsat", 
                 "jobperf", "turnover", "iq")

# Replace all missing values (-99) with R missing value character 'NA'.
employee[employee==-99] <- NA

Create Regression Model Object (Brute Force)

Basically, the brute force method entails correlating the auxiliary variables with other auxiliary variable, the predictors, and the residuals for the outcome variable.

# The b1* and b2* are labels used in the Wald test below
model <- '
jobperf ~ b1*wbeing + b2*jobsat
wbeing ~~ jobsat
wbeing ~~ turnover + iq
jobsat ~~ turnover + iq
jobperf ~~ turnover + iq
turnover ~~ iq
'

Fit and Summarize the Model

fit <- sem(model, employee, missing='fiml', fixed.x=FALSE, 
           meanstructure=TRUE)
summary(fit, fit.measures=TRUE, rsquare=T, standardize=T)

Wald test

Just as we did in the previous section.

lavTestWald(fit, 
            'b1 == 0
             b2 == 0')

Using auxiliary Command in `semTools`

First, load the semTools package

library(semTools)

Create Regression Model Object

Next, create a model object with just the model of interest

model2 <- '
jobperf ~ wbeing + jobsat
'

Then, create a vector of the names of the auxiliary variables

aux.vars <- c('turnover', 'iq')

Fit the Model

Then, fit the model to the new model object.

fit2 <- sem(model2, employee, missing='fiml', meanstructure=TRUE, fixed.x=FALSE)

Using this model object, fit another model that incorporates the auxiliary variables using the sem.auxiliary function from the semTools package.

auxfit <- sem.auxiliary(model=fit2, aux=aux.vars, data=employee)

Finally, summarize the model object that includes the auxiliary variables.

summary(auxfit, fit.measures=TRUE, rsquare=TRUE, standardize=TRUE)

FIML for Missing Data in lavaan

Descriptive Statistics

Header

Import and prepare data

Create Model Object

Fit the Model

Generate Output

Missing Data Patterns

Covariance Coverage

Regression Analysis

Import Data

Create Regression Model Object

Fit the Model

Generate Output

Wald Test

Regression Analysis with Auxiliary Variables

Brute Force Method

Import Data

Create Regression Model Object (Brute Force)

Fit and Summarize the Model

Wald test

Using auxiliary Command in semTools

Create Regression Model Object

Fit the Model

FIML for Missing Data in `lavaan`

Using auxiliary Command in `semTools`