Description

Automatically Runs 18 Individual and 14 Ensembles of Models.

Description

Automatically runs 18 individual models and 14 ensembles on numeric data, for a total of 32 models. The package automatically returns complete results on all 32 models, 30 charts and six tables. The user simply provides the tidy data, and answers a few questions (for example, how many times would you like to resample the data). From there the package randomly splits the data into train, test and validation sets, fits each of models on the training data, makes predictions on the test and validation sets, measures root mean squared error (RMSE), removes features above a user-set level of Variance Inflation Factor, and has several optional features including scaling all numeric data, four different ways to handle strings in the data. Perhaps the most significant feature is the package's ability to make predictions using the 32 pre trained models on totally new (untrained) data if the user selects that feature. This feature alone represents a very effective solution to the issue of reproducibility of models in data science. The package can also randomly resample the data as many times as the user sets, thus giving more accurate results than a single run. The graphs provide many results that are not typically found. For example, the package automatically calculates the Kolmogorov-Smirnov test for each of the 32 models and plots a bar chart of the results, a bias bar chart of each of the 32 models, as well as several plots for exploratory data analysis (automatic histograms of the numeric data, automatic histograms of the numeric data). The package also automatically creates a summary report that can be both sorted and searched for each of the 32 models, including RMSE, bias, train RMSE, test RMSE, validation RMSE, overfitting and duration. The best results on the holdout data typically beat the best results in data science competitions and published results for the same data set.

README.md

cran.r-project.org

NumericEnsembles

The goal of NumericEnsembles is to automatically conduct a thorough analysis of numeric data. The user only needs to provide the data and answer a few questions (such as which column to analyze). NumericEnsembles fits 18 individual models to the training data, and also makes predictions and checks accuracy for each of the individual models. It also builds 14 ensembles from the ensembles of data, fits each ensemble model to the training data then makes predictions and tracks accuracy for each ensemble. The package also automatically returns 26 plots (such as train vs holdout for the best model), 6 tables (such as head of the data), and a grand summary table sorted by accuracy with the best model at the top of the report.

Installation

You can install the development version of NumericEnsembles like so:

devtools::install_github("InfiniteCuriosity/NumericEnsembles")

Example

NumericEnsembles will automatically build 32 models to predict the sale price of houses in Boston, from the Boston housing data set.

library(NumericEnsembles)
Numeric(data = MASS::Boston,
        colnum = 14,
        numresamples = 2,
        remove_VIF_above = 5.00,
        remove_ensemble_correlations_greater_than = 1.00,
        scale_all_predictors_in_data = "N",
        data_reduction_method = 0,
        ensemble_reduction_method = 0,
        how_to_handle_strings = 0,
        predict_on_new_data = "N",
        save_all_trained_models = "N",
        set_seed = "N",
        save_all_plots = "N",
        use_parallel = "Y",
        train_amount = 0.60,
        test_amount = 0.20,
        validation_amount = 0.20)

The 32 models which are all built automatically and without error are:

Bagging
BayesGLM
BayesRNN
Cubist
Earth
Elastic (optimized by cross-validation)
Ensemble Bagging
Ensemble BayesGLM
Ensemble BayesRNN
Ensemble Cubist
Ensemble Earth
Ensemble Elastic (optimized by cross-validation)
Ensemble Gradient Boosted
Ensemble Lasso (optimized by cross-validation)
Ensemble Linear (tuned)
Ensemble Ridge (optimized by cross-validation)
Ensemble RPart
EnsembleSVM (tuned)
Ensemble Trees
Ensemble XGBoost
GAM (Generalized Additive Models, with smoothing splines)
Gradient Boosted (optimized)
Lasso
Linear (tuned)
Neuralnet
PCR (Principal Components Regression)
PLS (Partial Least Squares)
Ridge (optimized by cross-validation)
RPart
SVM (Support Vector Machines, tuned)
Tree
XGBoost

The 30 plots created automatically:

Correlation plot of the numeric data (as numbers and colors)
Correlation plot of the numeric data (as circles with colors)
Cook's D Bar Plot
Four plots in one for the most accurate model: Predicted vs actual, Residuals, Histogram of residuals, Q-Q plot
Most accurate model: Predicted vs actual
Most accurate model: Residuals
Most accurate model: Histogram of residuals
Most accurate model: Q-Q plot
Accuracy by resample and model, fixed scales
Accuracy by resample and model, free scales
Holdout RMSE/train RMSE, fixed scales
Holdout RMSE/train RMSE, free scales
Histograms of each numeric column
Boxplots of each numeric column
Predictor vs target variable
Model accuracy bar chart (RMSE)
t-test p-value bar chart
Train vs holdout by resample and model, free scales
Train vs holdout by resampleand model, fixed scales
Duration bar chart
Holdout RMSE / train RMSE bar chart
Mean bias bar chart
Mean MSE bar chart
Mean MAE bar chart
Mean SSE bar chart
Kolmogorov-Smirnof test bar chart
Bias plot by model and resample
MSE plot by model and resample
MAE plot by model and resample
SSE plot by model and resample

The tables created automatically (which are both searchable and sortable) are:

Variance Inflation Factor
Correlation of the ensemble
Head of the ensemble
Data summary
Correlation of the data
Grand summary table includes:
Mean holdout RMSE
Standard deviation of mean holdout RMSE
t-test value
t-test p-value
t-test p-value standard deviation
Kolmogorov-Smirnov stat mean
Kolmogorov-Smirnov stat p-value
Kolmogorov-Smirnov stat standard deviation
Mean bias
Mean bias standard deviation
Mean MAE
Mean MAE standard deviation
Mean MSE
Mean MSE standard deviation
Mean SSE
Mean SSE standard deviation
Mean data (this is the mean of the target column in the original data set)
Standard deviation of mean data (this is the standard deviation of the data in the target column in the original data set)
Mean train RMSE
Mean test RMSE
Mean validation RMSE
Holdout vs train mean
Holdout vs train standard deviation
Duration
Duration standard deviation

Example using pre-trained models on totally new data in the NumericEnsembles package

The NumericEnsembles package also has a way to create trained models and test those pre-trained models on totally unseen data using the same pre-trained models as on the initial analysis.

The package contains two example data sets to demonstrate this result. Boston_Housing is the Boston Housing data set, but the first five rows have been removed. We will build our models on that data set. NewBoston is totally new data, and actually the first five rows from the original Boston Housing data set.

library(NumericEnsembles)
Numeric(data = Boston_housing,
        colnum = 14,
        numresamples = 25,
        remove_VIF_above = 5.00,
        remove_ensemble_correlations_greater_than = 1.00,
        scale_all_predictors_in_data = "N",
        data_reduction_method = 0,
        ensemble_reduction_method = 0,
        how_to_handle_strings = 0,
        predict_on_new_data = "Y",
        set_seed = "N",
        save_all_trained_models = "N",
        save_all_plots = "N",
        use_parallel = "Y",
        train_amount = 0.60,
        test_amount = 0.20,
        validation_amount = 0.20)

Use the data set New_Boston when asked for "What is the URL of the new data?". The URL for the new data is: https://raw.githubusercontent.com/InfiniteCuriosity/EnsemblesData/refs/heads/main/NewBoston.csv

External data may be used to accomplish the same result.

r-NumericEnsembles

NumericEnsembles

Installation

Example

Example using pre-trained models on totally new data in the NumericEnsembles package

Version

License

Status

Source

Homepage

Platforms (76)

NumericEnsembles

Installation

Example

Example using pre-trained models on totally new data in the NumericEnsembles package

Version

License

Status

Source

Homepage

Platforms76 (76)

Platforms (76)