MyNixOS website logo
Description

Automatically Runs 18 Individual and 14 Ensembles of Models.

Automatically runs 18 individual models and 14 ensembles on numeric data, for a total of 32 models. The package automatically returns complete results on all 32 models, 30 charts and six tables. The user simply provides the tidy data, and answers a few questions (for example, how many times would you like to resample the data). From there the package randomly splits the data into train, test and validation sets, fits each of models on the training data, makes predictions on the test and validation sets, measures root mean squared error (RMSE), removes features above a user-set level of Variance Inflation Factor, and has several optional features including scaling all numeric data, four different ways to handle strings in the data. Perhaps the most significant feature is the package's ability to make predictions using the 32 pre trained models on totally new (untrained) data if the user selects that feature. This feature alone represents a very effective solution to the issue of reproducibility of models in data science. The package can also randomly resample the data as many times as the user sets, thus giving more accurate results than a single run. The graphs provide many results that are not typically found. For example, the package automatically calculates the Kolmogorov-Smirnov test for each of the 32 models and plots a bar chart of the results, a bias bar chart of each of the 32 models, as well as several plots for exploratory data analysis (automatic histograms of the numeric data, automatic histograms of the numeric data). The package also automatically creates a summary report that can be both sorted and searched for each of the 32 models, including RMSE, bias, train RMSE, test RMSE, validation RMSE, overfitting and duration. The best results on the holdout data typically beat the best results in data science competitions and published results for the same data set.

NumericEnsembles

The goal of NumericEnsembles is to automatically conduct a thorough analysis of numeric data. The user only needs to provide the data and answer a few questions (such as which column to analyze). NumericEnsembles fits 18 individual models to the training data, and also makes predictions and checks accuracy for each of the individual models. It also builds 14 ensembles from the ensembles of data, fits each ensemble model to the training data then makes predictions and tracks accuracy for each ensemble. The package also automatically returns 26 plots (such as train vs holdout for the best model), 6 tables (such as head of the data), and a grand summary table sorted by accuracy with the best model at the top of the report.

Installation

You can install the development version of NumericEnsembles like so:

devtools::install_github("InfiniteCuriosity/NumericEnsembles")

Example

NumericEnsembles will automatically build 32 models to predict the sale price of houses in Boston, from the Boston housing data set.

library(NumericEnsembles)
Numeric(data = MASS::Boston,
        colnum = 14,
        numresamples = 2,
        remove_VIF_above = 5.00,
        remove_ensemble_correlations_greater_than = 1.00,
        scale_all_predictors_in_data = "N",
        data_reduction_method = 0,
        ensemble_reduction_method = 0,
        how_to_handle_strings = 0,
        predict_on_new_data = "N",
        save_all_trained_models = "N",
        set_seed = "N",
        save_all_plots = "N",
        use_parallel = "Y",
        train_amount = 0.60,
        test_amount = 0.20,
        validation_amount = 0.20)

The 32 models which are all built automatically and without error are:

  1. Bagging
  2. BayesGLM
  3. BayesRNN
  4. Cubist
  5. Earth
  6. Elastic (optimized by cross-validation)
  7. Ensemble Bagging
  8. Ensemble BayesGLM
  9. Ensemble BayesRNN
  10. Ensemble Cubist
  11. Ensemble Earth
  12. Ensemble Elastic (optimized by cross-validation)
  13. Ensemble Gradient Boosted
  14. Ensemble Lasso (optimized by cross-validation)
  15. Ensemble Linear (tuned)
  16. Ensemble Ridge (optimized by cross-validation)
  17. Ensemble RPart
  18. EnsembleSVM (tuned)
  19. Ensemble Trees
  20. Ensemble XGBoost
  21. GAM (Generalized Additive Models, with smoothing splines)
  22. Gradient Boosted (optimized)
  23. Lasso
  24. Linear (tuned)
  25. Neuralnet
  26. PCR (Principal Components Regression)
  27. PLS (Partial Least Squares)
  28. Ridge (optimized by cross-validation)
  29. RPart
  30. SVM (Support Vector Machines, tuned)
  31. Tree
  32. XGBoost

The 30 plots created automatically:

  1. Correlation plot of the numeric data (as numbers and colors)
  2. Correlation plot of the numeric data (as circles with colors)
  3. Cook's D Bar Plot
  4. Four plots in one for the most accurate model: Predicted vs actual, Residuals, Histogram of residuals, Q-Q plot
  5. Most accurate model: Predicted vs actual
  6. Most accurate model: Residuals
  7. Most accurate model: Histogram of residuals
  8. Most accurate model: Q-Q plot
  9. Accuracy by resample and model, fixed scales
  10. Accuracy by resample and model, free scales
  11. Holdout RMSE/train RMSE, fixed scales
  12. Holdout RMSE/train RMSE, free scales
  13. Histograms of each numeric column
  14. Boxplots of each numeric column
  15. Predictor vs target variable
  16. Model accuracy bar chart (RMSE)
  17. t-test p-value bar chart
  18. Train vs holdout by resample and model, free scales
  19. Train vs holdout by resampleand model, fixed scales
  20. Duration bar chart
  21. Holdout RMSE / train RMSE bar chart
  22. Mean bias bar chart
  23. Mean MSE bar chart
  24. Mean MAE bar chart
  25. Mean SSE bar chart
  26. Kolmogorov-Smirnof test bar chart
  27. Bias plot by model and resample
  28. MSE plot by model and resample
  29. MAE plot by model and resample
  30. SSE plot by model and resample

The tables created automatically (which are both searchable and sortable) are:

  1. Variance Inflation Factor
  2. Correlation of the ensemble
  3. Head of the ensemble
  4. Data summary
  5. Correlation of the data
  6. Grand summary table includes:
  7. Mean holdout RMSE
  8. Standard deviation of mean holdout RMSE
  9. t-test value
  10. t-test p-value
  11. t-test p-value standard deviation
  12. Kolmogorov-Smirnov stat mean
  13. Kolmogorov-Smirnov stat p-value
  14. Kolmogorov-Smirnov stat standard deviation
  15. Mean bias
  16. Mean bias standard deviation
  17. Mean MAE
  18. Mean MAE standard deviation
  19. Mean MSE
  20. Mean MSE standard deviation
  21. Mean SSE
  22. Mean SSE standard deviation
  23. Mean data (this is the mean of the target column in the original data set)
  24. Standard deviation of mean data (this is the standard deviation of the data in the target column in the original data set)
  25. Mean train RMSE
  26. Mean test RMSE
  27. Mean validation RMSE
  28. Holdout vs train mean
  29. Holdout vs train standard deviation
  30. Duration
  31. Duration standard deviation

Example using pre-trained models on totally new data in the NumericEnsembles package

The NumericEnsembles package also has a way to create trained models and test those pre-trained models on totally unseen data using the same pre-trained models as on the initial analysis.

The package contains two example data sets to demonstrate this result. Boston_Housing is the Boston Housing data set, but the first five rows have been removed. We will build our models on that data set. NewBoston is totally new data, and actually the first five rows from the original Boston Housing data set.

library(NumericEnsembles)
Numeric(data = Boston_housing,
        colnum = 14,
        numresamples = 25,
        remove_VIF_above = 5.00,
        remove_ensemble_correlations_greater_than = 1.00,
        scale_all_predictors_in_data = "N",
        data_reduction_method = 0,
        ensemble_reduction_method = 0,
        how_to_handle_strings = 0,
        predict_on_new_data = "Y",
        set_seed = "N",
        save_all_trained_models = "N",
        save_all_plots = "N",
        use_parallel = "Y",
        train_amount = 0.60,
        test_amount = 0.20,
        validation_amount = 0.20)

Use the data set New_Boston when asked for "What is the URL of the new data?". The URL for the new data is: https://raw.githubusercontent.com/InfiniteCuriosity/EnsemblesData/refs/heads/main/NewBoston.csv

External data may be used to accomplish the same result.

Metadata

Version

0.10.3

License

Unknown

Platforms (76)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows