This tutorial demonstrates how to use full information maximum likelihood (FIML) estimation to deal with missing data in a regression model using
In this post I use FIML to deal with missing data in a multiple regression framework. First, I import the data from a text file named ‘employee.dat’. You can download a zip file of the data from Applied Missing Data website. I also have a github page for these examples here. Remember to replace the file path in the read.table function with the path to the text file location on your computer.
employee <- read.table("data/employee.dat")
Because the original text file does not include variable names, I name the variables in the new data frame:
names(employee) <- c("id", "age", "tenure", "female", "wbeing", "jobsat", "jobperf", "turnover", "iq")
then I recode all data points with the value of -99 in the original text file, which indicates a missing value, to
NA, the missing data value recognized by R.
employee[employee == -99] <- NA
Create Regression Model Object
Now we are ready to create a character string containing the regression model using the
lavaan model conventions. Note that b1 and b2 are labels that will be used later for the Wald test. These labels are equivalent to (b1) and (b2) after these variables in the Mplus code.
model <- ' # Regression model jobperf ~ b1*wbeing + b2*jobsat # Variances wbeing ~~ wbeing jobsat ~~ jobsat # Covariance/correlation wbeing ~~ jobsat '
In addition to the regression model, I also estimated the variances and covariances of the predictors. I did this to replicate the results of the original Mplus example. In Mplus you have to estimate the variances of all of the predictors if any of them have missing data that you would like to model. In
lavaan the fixed.x=FALSE argument has the same effect (see below).
Fit the Model
Next, I use the sem function to fit the model.
fit <- sem(model, employee, missing='fiml', meanstructure=TRUE, fixed.x=FALSE)
Listwise deletion is the default, so the missing=‘fiml’ argument tell
lavaan to use the FIML instead. I also included the meanstructure=TRUE argument to include the means of the observed variables in the model, and the fixed.x=FALSE argument to estimate the means, variances, and covariances. Again, I do this to replicate the results of the original Mplus example.
We are now ready to look at the results.
summary(fit, fit.measures=TRUE, rsquare=TRUE, standardize=TRUE)
Compared to what we learned in the last post, the only thing new to the summary function is the rsquare=TRUE argument, which, not surprisingly, results in the model R2 being included in the summary output.
I only show the Parameter estimates section here:
Parameter estimates: Information Observed Standard Errors Standard Estimate Std.err Z-value P(>|z|) Std.lv Std.all Regressions: jobperf ~ wbeing (b1) 0.476 0.055 8.665 0.000 0.476 0.447 jobsat (b2) 0.027 0.060 0.444 0.657 0.027 0.025 Covariances: wbeing ~~ jobsat 0.467 0.098 4.780 0.000 0.467 0.336 Intercepts: jobperf 2.869 0.382 7.518 0.000 2.869 2.289 wbeing 6.286 0.063 99.692 0.000 6.286 5.338 jobsat 5.959 0.065 91.836 0.000 5.959 5.055 Variances: wbeing 1.387 0.108 1.387 1.000 jobsat 1.390 0.109 1.390 1.000 jobperf 1.243 0.087 1.243 0.792 R-Square: jobperf 0.208
lavaan the Wald test is called separately from the estimation function. This function will use the labels assigned in the model object above.
# Wald test is called seperately. lavTestWald(fit, constraints='b1 == 0 b2 == 0')
Results of Wald Test
$stat  95.88081 $df  2 $p.value  0
There you have it! Regression with FIML in R. But, what if you have variables that you are not interested in incorporating in your model, but may have information about the missingness in the variables that are in your model? I will talk about that in the next post.