Description

Stabilising Variable Selection.

Description

A stable approach to variable selection through stability selection and the use of a permutation-based objective stability threshold. Lima et al (2021) <doi:10.1038/s41598-020-79317-8>, Meinshausen and Buhlmann (2010) <doi:10.1111/j.1467-9868.2010.00740.x>.

README.md

cran.r-project.org

stabiliser

The goal of the package is to provide a flexible method of applying stability selection (Meinshausen and Buhlmann, 2010) with various model types, and the framework for triangulating the results for multiple models (Lima et al., 2021).

stabilise() performs stability selection on a range of models to identify variables truly associated with an outcome.
triangulate() identifies which variables are most likely to be associated with an outcome across all models.
stab_plot() allows visualisation of either stabilise() or triangulate() outputs.
simulate_data() allows the simulation of datasets to facilitate user the comparison of various statistical methods.

Installation

You can install this package from CRAN as follows:

install.packages("stabiliser")

Or using the developmental version as follows:

devtools::install_github("roberthyde/stabiliser")

Usage

The stabiliser_example is a simulated example dataset with 50 observations of the following variables:

1 simulated outcome variable: y
4 variables simulated to be associated with y: causal1, causal2…
95 variables simulated to have no association with y: junk1, junk2…

library(stabiliser)
data("stabiliser_example")

`stabilise()`

To attempt to identify which variables are truly “causal” in this dataset using a selection stability approach, use the stabilise() function as follows:

stable_enet <- stabilise(data = stabiliser_example,
                         outcome = "y")

Access the stability (percentage of bootstrap resamples where a given variable was selected by a given model) results for elastic net as follows:

stable_enet$enet$stability
#> # A tibble: 100 x 7
#>    variable mean_coefficient ci_lower ci_upper bootstrap_p stability stable
#>    <chr>               <dbl>    <dbl>    <dbl>       <dbl>     <dbl> <chr> 
#>  1 causal1              2.58   0.207    5.45             0      90   *     
#>  2 causal3              2.59   0.222    6.49             0      78.5 *     
#>  3 causal2              2.12   0.0803   5.01             0      78   *     
#>  4 junk14               2.58   0.0695   7.31             0      71   <NA>  
#>  5 junk45               2.54   0.0526   5.76             0      63.5 <NA>  
#>  6 junk13               2.55   0.136    6.07             0      60.5 <NA>  
#>  7 junk88               1.88   0.0300   4.87             0      58   <NA>  
#>  8 junk86               1.59   0.110    4.98             0      57   <NA>  
#>  9 junk90               1.98   0.161    5.20             0      56   <NA>  
#> 10 junk65              -1.25  -3.65    -0.0244           0      46.5 <NA>  
#> # ... with 90 more rows

This ranks the variables by stability, and displays the mean coefficients, 95% confidence interval and bootstrap p-value. It also displays whether the variable is deemed “stable”, in this case 3 out of the 4 truly causal variables are identified as “stable”, with no false positives.

By default, this implements an elastic net algorithm over a number of bootstrap resamples of the dataset (200 resamples for small datasets). The stability of each variable is then calculated as the proportion of bootstrap repeats where that variable is selected in the model.

stabilise() also permutes the outcome several times (10 by default for small datasets) and performs the same process on each permuted dataset (20 bootstrap resamples for each by default).

This allows a permutation threshold to be calculated. Variables with a non-permuted stability % above this threshold are deemed “stable” as they were selected in a higher proportion of bootstrap resamples than in the permuted datasets, where we know there is no association between variables and the outcome.

The permutation threshold is available as follows:

stable_enet$enet$perm_thresh
#> [1] 74

`triangulate()`

Our confidence that a given variable is truly associated with a given outcome might be increased if it is identified in multiple model types.

Specify multiple model types (elastic net, mbic and mcp) for comparison using the stabilise() function as follows:

stable_combi <- stabilise(data = stabiliser_example,
                         outcome = "y",
                         models = c("enet",
                                    "mbic",
                                    "mcp"))

The stability of variables is available for all model types as before.

The stability results from these three models stored within stable_combi can be combined using triangulate as follows:

triangulated <- triangulate(stable_combi)
triangulated
#> $combi
#> $combi$stability
#> # A tibble: 100 x 4
#>    variable stability bootstrap_p stable
#>    <chr>        <dbl>       <dbl> <chr> 
#>  1 causal1       62.3     0.00292 *     
#>  2 causal2       46.5     0       *     
#>  3 causal3       45.3     0       *     
#>  4 junk13        38       0       <NA>  
#>  5 junk14        37.8     0       <NA>  
#>  6 junk88        34.5     0       <NA>  
#>  7 junk45        29.5     0       <NA>  
#>  8 junk90        28.8     0       <NA>  
#>  9 junk86        28       0.0204  <NA>  
#> 10 junk65        19.8     0       <NA>  
#> # ... with 90 more rows
#> 
#> $combi$perm_thresh
#> [1] 43

This shows variables consistently being identified as being stable across multiple model types, and consequently increasing our confidence that they are truly associated with the outcome.

Triangulating the results for stability selection across multiple model types generally provides better performance at identifying truly causal variables than single model approaches (Lima et al., 2021). As shown in this example, the triangulated approach identifies 3 out of the 4 truly causal variables as being “stable”, and being highly likely to be associated with the outcome y.

`stab_plot()`

Both stabilise() and triangulate() outputs can be plotted using stab_plot() as follows, with causal and junk variables highlighted for this example:

stab_plot(stabiliser_object = triangulated)

`simulate_data()`

To determine the optimum statistical approach, it can be useful to simulate datasets that are simular to those being explored within a research project, but with known outcomes.

The package includes a simulate_data() function, which allows the user to simulate a dataset as follows:

nrows: the number of rows to simulate.
ncols: The number of columns to simulate.
n_true: The number of variables truly associated with the outcome.
amplitude: The strength of association between true variables and the outcome.

For example, simulating a 5 row dataset, with 5 explanatory variables, where 2 variables are truly associated with the outcome with a signal strength of 8 would be conducted using the code below. Note that the variables simulated to be truly associated with the outcome are labeled as “true_”, and variables that are randomly generated labeled as “junk_”.

simulate_data(nrows=5, 
              ncols=5,
              n_true=2,
              amplitude=8)
#>     outcome    junk_V1    true_V2    true_V3    junk_V4     junk_V5
#> 1  0.221365 -1.4714379  0.1265166 -0.2050986 -1.9603030  0.74586529
#> 2 -4.524082  0.8873447 -0.6569296 -0.4972537  1.0603469  1.03974907
#> 3 -3.948917  0.4930688 -0.4016229 -0.7191238 -0.7291698 -0.79151647
#> 4 -2.617451 -0.2217698 -0.6222874 -0.1586967  0.9645390 -0.03753633
#> 5  4.613613  0.1590784 -0.3425437  1.6222826  1.4840578  0.15464136

It can also be interesting to simulate datasets with no variables associated with the outcome (other than by chance). This can be done by setting n_true to zero (or leaving to the default of zero).

simulate_data(nrows=5, 
              ncols=5,
              n_true=0)
#>      outcome    junk_V1     junk_V2    junk_V3     junk_V4     junk_V5
#> 1  2.2007843 -0.8401319 -1.52055028 -1.1568187  1.94737237 -0.12843586
#> 2 -0.1496611  0.7128174  0.50553131  0.8950869  0.02989857 -0.51574352
#> 3 -0.9676750  1.2115235  0.88489231  0.6895640 -0.39529438 -0.16517106
#> 4  0.3468111  1.3880437 -0.31587319  1.2727221  0.29985638  0.06233299
#> 5 -0.9533052 -1.0614021  0.09315892 -0.4079483 -0.17326474 -0.98466363

By using simulated datasets with no signal other than by chance, it is possible to explore various modeling approaches to determine how many false positive variables might be selected with a given approach.

References

Lima, E., Hyde, R., Green, M., 2021. Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques. Sci. Rep. 11, 412. https://doi.org/10.1038/s41598-020-79317-8

Meinshausen, N., Buhlmann, P., 2010. Stability selection. J. R. Stat. Soc. Ser. B (Statistical Methodol. 72, 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x.

r-stabiliser

stabiliser

Installation

Usage

`stabilise()`

`triangulate()`

`stab_plot()`

`simulate_data()`

References

Version

License

Status

Source

Homepage

Platforms (75)

stabiliser

Installation

Usage

stabilise()

triangulate()

stab_plot()

simulate_data()

References

Version

License

Status

Source

Homepage

Platforms75 (75)

`stabilise()`

`triangulate()`

`stab_plot()`

`simulate_data()`

Platforms (75)