Description

Random Forests for Multiple Imputation Based on 'ranger'.

Description

An updated implementation of R package 'ranger' by Wright et al, (2017) <doi:10.18637/jss.v077.i01> for training and predicting from random forests, particularly suited to high-dimensional data, and for embedding in 'Multiple Imputation by Chained Equations' (MICE) by van Buuren (2007) <doi:10.1177/0962280206074463>. Ensembles of classification and regression trees are currently supported. Sparse data of class 'dgCMatrix' (R package 'Matrix') can be directly analyzed. Conventional bagged predictions are available alongside an efficient prediction for MICE via the algorithm proposed by Doove et al (2014) <doi:10.1016/j.csda.2013.10.025>. Survival and probability forests are not supported in the update, nor is data of class 'gwaa.data' (R package 'GenABEL'); use the original 'ranger' package for these analyses.

README.md

cran.r-project.org

literanger: A fast implementation of random forests for multiple imputation

by stephematician

literanger is an adaptation of the ranger R package for training and predicting from random forest models within multiple imputation algorithms. ranger is a fast implementation of random forests (Breiman, 2001) or recursive partitioning, particularly suited for high dimensional data (Wright et al, 2017). literanger redesigned the ranger interface to achieve faster prediction, and is now available as a backend for random forests within 'Multiple Imputation via Chained Equations' (Van Buuren, 2007) in the R package mice.

Efficient serialization, i.e. reading and writing, of a trained random forest is provided via the cereal library.

Example

require(literanger)

train_idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris_train <- iris[ train_idx, ]
iris_test  <- iris[-train_idx, ]
rf_iris <- train(data=iris_train, response_name="Species")
pred_iris_bagged <- predict(rf_iris, newdata=iris_test,
                            prediction_type="bagged")
pred_iris_inbag  <- predict(rf_iris, newdata=iris_test,
                            prediction_type="inbag")
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)

Literanger supports reading/writing random forests (serialization). We can save rf_iris above using the function call:

write_literanger(rf_iris, "rf_iris.literanger")

In a new R session, we can read the random forest object in and predict for a new test set:

test_idx <- sample(nrow(iris), 1/3 * nrow(iris))
iris_test  <- iris[test_idx, ]
rf_iris_copy <- read_literanger("rf_iris.literanger")
table(iris_test$Specis, predict(rf_iris_copy, newdata=iris_test)$values)

Installation

The release can be installed via:

install.packages('literanger')

The development version can be installed using remotes:

remotes::install_gitlab('stephematician/literanger')

Technical details

A minor variation on mice's use of random forests is available; each prediction is drawn from in-bag samples from a random tree - thus the computational effort is constant with respect to the size of the forest (number of trees) compared to the original implementation in mice.

The interface of ranger was redesigned such that the trained forest object can be recycled, and the data for training and prediction are passed without (unnecessary) copies, see rangerissue #304.

To-do

Non-exhaustive:

implement variable importance measures;
probability and survival forests.

References

Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.

Grant, W. S., and Voorhies, R., 2017. cereal - A C++11 library for serialization. https://uscilab.github.io/cereal/.

Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.

r-literanger

literanger: A fast implementation of random forests for multiple imputation

Example

Installation

Technical details

To-do

References

Version

License

Status

Source

Homepage

Platforms (77)

literanger: A fast implementation of random forests for multiple imputation

Example

Installation

Technical details

To-do

References

Version

License

Status

Source

Homepage

Platforms77 (77)

Platforms (77)