Flexible Data Simulation Using the Multivariate Normal Distribution.
fake: Flexible Data Simulation Using The Multivariate Normal Distribution
Description
This R package can be used to generate artificial data conditionally on pre-specified (simulated or user-defined) relationships between the variables and/or observations. Each observation is drawn from a multivariate Normal distribution where the mean vector and covariance matrix reflect the desired relationships. Outputs can be used to evaluate the performances of variable selection, graphical modelling, or clustering approaches by comparing the true and estimated structures.
Installation
The released version of the package can be installed from CRAN with:
install.packages("fake")
The development version can be installed from GitHub:
remotes::install_github("barbarabodinier/fake")
Main functions
Linear model
library(fake)
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 20)
head(simul$xdata)
head(simul$ydata)
Logistic model
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 20, family = "binomial")
head(simul$ydata)
Structural causal model
set.seed(1)
simul <- SimulateStructural(n = 100, pk = c(3, 2, 3))
head(simul$data)
Gaussian graphical model
set.seed(1)
simul <- SimulateGraphical(n = 100, pk = 20)
head(simul$data)
Gaussian mixture model
set.seed(1)
simul <- SimulateClustering(n = c(10, 10, 10), pk = 20)
head(simul$data)
Extraction and visualisation of the results
The true model structure is returned in the output of any of the main functions in:
simul$theta
The functions print()
, summary()
and plot()
can be used on the outputs from the main functions.
Reference
- Barbara Bodinier, Sarah Filippi, Therese Haugdahl Nost, Julien Chiquet and Marc Chadeau-Hyam. Automated calibration for stability selection in penalised regression and graphical models: a multi-OMICs network application exploring the molecular response to tobacco smoking. (2021) arXiv. link