Random Forests for Multiple Imputation Based on 'ranger'.
literanger: A fast implementation of random forests for multiple imputation
by stephematician
is an adaptation of the ranger
R package for training and predicting from random forest models within multiple imputation algorithms. ranger
is a fast implementation of random forests (Breiman, 2001) or recursive partitioning, particularly suited for high dimensional data (Wright et al, 2017). literanger
redesigned the ranger
interface to achieve faster prediction, and is now available as a backend for random forests within 'Multiple Imputation via Chained Equations' (Van Buuren, 2007) in the R package mice
Efficient serialization, i.e. reading and writing, of a trained random forest is provided via the cereal library.
train_idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris_train <- iris[ train_idx, ]
iris_test <- iris[-train_idx, ]
rf_iris <- train(data=iris_train, response_name="Species")
pred_iris_bagged <- predict(rf_iris, newdata=iris_test,
pred_iris_inbag <- predict(rf_iris, newdata=iris_test,
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)
Literanger supports reading/writing random forests (serialization). We can save rf_iris
above using the function call:
write_literanger(rf_iris, "rf_iris.literanger")
In a new R session, we can read the random forest object in and predict for a new test set:
test_idx <- sample(nrow(iris), 1/3 * nrow(iris))
iris_test <- iris[test_idx, ]
rf_iris_copy <- read_literanger("rf_iris.literanger")
table(iris_test$Specis, predict(rf_iris_copy, newdata=iris_test)$values)
The release can be installed via:
The development version can be installed using remotes
Technical details
A minor variation on mice
's use of random forests is available; each prediction is drawn from in-bag samples from a random tree - thus the computational effort is constant with respect to the size of the forest (number of trees) compared to the original implementation in mice
The interface of ranger
was redesigned such that the trained forest object can be recycled, and the data for training and prediction are passed without (unnecessary) copies, see ranger
issue #304.
- implement variable importance measures;
- probability and survival forests.
Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.
Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.
Grant, W. S., and Voorhies, R., 2017. cereal - A C++11 library for serialization. https://uscilab.github.io/cereal/.
Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.