Description

Missingness Alleviation for Network Analysis.

Description

Provides functionality for estimating cross-sectional network structures representing partial correlations in R, while accounting for missing values in the data. Networks are estimated via neighborhood selection, i.e., node-wise multiple regression, with model selection guided by information criteria. Missing data can be handled primarily via multiple imputation or a maximum likelihood-based approach; deletion techniques are available but secondary <doi:10.31234/osf.io/qpj35>.

README.md

cran.r-project.org

mantar - Missingness Alleviation for NeTwork Analysis in R

mantar provides users with several methods for handling missing data in the context of network analysis. Currently, these methods are specifically implemented for network estimation via neighborhood selection using the Bayesian Information Criterion (BIC).

Installation

You can install the development version of mantar from GitHub with:

# install.packages("pak")
pak::pak("kai-nehler/mantar")

After installation the easiest way to get an overview of functions and capabilities is to use ?mantar to open the package help-file. You could also read the rest of this README for an introduction and some examples.

Features

As already described, the package currently focuses on network estimation using neighborhood selection with information criteria for model selection in node-wise regressions. This functionality is available for both complete and incomplete data.

For datasets with missing values, two modern missing approaches are implemented:

Two-step Expectation-Maximization (EM): A fast method that estimates the correlation matrix via an EM algorithm using the lavaan package. It performs well when the sample size is very large relative to the amount of missingness and the complexity of the network.
Stacked Multiple Imputation (MI): A more robust approach across a wider range of sample sizes. Multiple imputation is performed using predictive mean matching (PMM) with the mice package. The imputed data sets are stacked into a single data set, and a correlation matrix is estimated from this combined data.

Both methods produce a correlation matrix that is then used to estimate the network via node-wise regressions. It is also possible to compute the correlation matrix using pairwise or listwise deletion. However, these methods are generally not recommended, except in specific cases, such as when data are missing completely at random and the proportion of missingness is very small.

In addition to network estimation, the package also supports stepwise regression search based on information criteria for a single dependent variable. This regression search is available for both complete and incomplete data and relies on the same two-step EM or stacked MI procedures to handle missing values as the network analysis. While both methods to handle missingness are expected to perform well in this context, no specific simulation study has been conducted to compare their effectiveness for single regression modeling, and thus their relative strengths remain an open question.

Example

The package includes two dummy datasets that resemble a typical psychological dataset, where the number of observations is considerably larger than the number of variables. Although the variables have descriptive names, these are included solely to make the examples more engaging - the data themselves are fully synthetic.

mantar_dummy_full: Fully observed data (no missing values)
mantar_dummy_mis: Data with missing values

These data sets are intended for examples and testing only.

library(mantar)

# Load example data
data(mantar_dummy_full)
data(mantar_dummy_mis)

# Preview the first few rows
head(mantar_dummy_full)
#>   EmoReactivity  TendWorry StressSens  SelfAware  Moodiness    Cautious
#> 1   -0.08824641 -0.2659269 -1.2036137 -2.3499259  0.6693700  0.04102854
#> 2   -0.44657803 -0.4588384 -0.2431794 -0.1656722 -0.3361568  0.88919849
#> 3   -1.06934325 -1.5050242 -0.8986388 -1.0857552  0.2249633  0.77060142
#> 4    0.58282029 -0.5036316 -1.6020000  1.0820676 -0.1858346 -0.03462852
#> 5    0.58791759  0.5972580 -0.5882332  1.7461103  0.7160714  1.58280444
#> 6    0.10224725  0.1494428 -1.0877812 -1.7886107  1.3522197 -0.25494638
#>   ThoughtFuture RespCriticism
#> 1     0.6484939   -0.77992262
#> 2     0.2949630   -0.91747608
#> 3    -1.3519007    0.56000763
#> 4    -0.4702988    0.34653985
#> 5     0.9503597    0.82981174
#> 6    -0.8938618   -0.01593388
head(mantar_dummy_mis)
#>   EmoReactivity  TendWorry StressSens   SelfAware  Moodiness    Cautious
#> 1    -1.7551632 -0.4376210 -0.5774722  0.10562820  0.6614044          NA
#> 2    -1.7551688 -0.7039623  0.9070330  0.03418623  0.6140406  0.83879818
#> 3     2.0493638         NA         NA          NA -0.8872971  0.04830719
#> 4     0.1056282         NA         NA -1.24779117 -0.7298623 -0.62263184
#> 5    -0.6338512  0.4361078 -0.5564631 -0.01032403         NA -0.09690612
#> 6     0.1054382  0.6935808  2.6557231          NA         NA -0.04358574
#>   ThoughtFuture RespCriticism
#> 1     0.7710993    0.37233355
#> 2    -1.5588119   -0.55079199
#> 3            NA   -0.90103222
#> 4    -0.7100126    0.80773402
#> 5     1.0583312    0.20820252
#> 6            NA   -0.03915726

The main function for estimating a network is neighborhood_net(). In the case of fully observed data, the function takes the dataset as input and estimates a network structure using neighborhood selection guided by information criteria. With default arguments, only the dataset needs to be provided.

Information Criteria

The k argument controls the penalty used in model selection for node-wise regressions. It reflects the penalty per parameter (i.e., number of predictors + 1):

k = "log(n)" (default): corresponds to the Bayesian Information Criterion (BIC)
k = "2": corresponds to the Akaike Information Criterion (AIC)

Estimation of Partial Correlation

The pcor_merge_rule argument determines how partial correlations are estimated based on the regression results between two nodes:

"and" (default): a partial correlation is estimated only if both regression weights (from node A to B and from B to A) are non-zero.
"or": a partial correlation is estimated if at least one of the two regression weights is non-zero.

Although both options are available, current simulation evidence suggests that the "and" rule yields more accurate partial correlation estimates than the "or" rule. Therefore, changing this default is not recommended unless you have a specific reason.

Example of Network Estimation without Missing Data

# Estimate network from full data set using BIC and and rule
result <- neighborhood_net(data = mantar_dummy_full, 
                           k = "log(n)", 
                           pcor_merge_rule = "and")
#> No missing values in data. Sample size for each variable is equal to the number of rows in the data.
# View estimated partial correlations
result
#>               EmoReactivity TendWorry StressSens SelfAware Moodiness  Cautious
#> EmoReactivity     0.0000000 0.2617524   0.130019 0.0000000 0.0000000 0.0000000
#> TendWorry         0.2617524 0.0000000   0.000000 0.2431947 0.0000000 0.0000000
#> StressSens        0.1300190 0.0000000   0.000000 0.0000000 0.0000000 0.0000000
#> SelfAware         0.0000000 0.2431947   0.000000 0.0000000 0.0000000 0.0000000
#> Moodiness         0.0000000 0.0000000   0.000000 0.0000000 0.0000000 0.4377322
#> Cautious          0.0000000 0.0000000   0.000000 0.0000000 0.4377322 0.0000000
#> ThoughtFuture     0.0000000 0.2595917   0.000000 0.0000000 0.0000000 0.0000000
#> RespCriticism     0.0000000 0.0000000   0.000000 0.0000000 0.2762595 0.2523658
#>               ThoughtFuture RespCriticism
#> EmoReactivity     0.0000000     0.0000000
#> TendWorry         0.2595917     0.0000000
#> StressSens        0.0000000     0.0000000
#> SelfAware         0.0000000     0.0000000
#> Moodiness         0.0000000     0.2762595
#> Cautious          0.0000000     0.2523658
#> ThoughtFuture     0.0000000     0.0000000
#> RespCriticism     0.0000000     0.0000000

# Create and view a summary of the network estimation
sum_result <- summary(result)
sum_result
#> The density of the estimated network is 0.250
#> 
#> Network was estimated using neighborhood selection with a penalty term of log(n)
#> and the 'and' rule for the inclusion of edges based on a full data set.
#> 
#> The sample sizes used for the nodewise regressions were as follows:
#> EmoReactivity     TendWorry    StressSens     SelfAware     Moodiness 
#>           400           400           400           400           400 
#>      Cautious ThoughtFuture RespCriticism 
#>           400           400           400

In the case of missing data, the neighborhood_net() function offers several additional arguments that control how sample size and missingness are handled.

Calculation of Sample Size

The n_calc argument specifies how the sample size is calculated for each node-wise regression. This affects the penalty term used in model selection.

The available options are:

"individual"(default): Uses the number of non-missing observations for each individual variable. This is the recommended approach.
"average": Uses the average number of non-missing observations across all variables.
"max": Uses the maximum number of non-missing observations across all variables.
"total": Uses the total number of observations in the dataset (i.e., the number of rows).

Handling Missing Data

The missing_handling argument specifies how the correlation matrix is estimated when the input data contains missing values. Two approaches are supported:

"two-step-em": Applies a classic Expectation-Maximization (EM) algorithm to estimate the covariance matrix.
"stacked-mi": Applies multiple imputation to create several completed datasets, which are then stacked into a single dataset. A correlation matrix is computed from this stacked data.

If "stacked-mi" is used, the nimp argument controls the number of imputations (default: 20).

Example of Network Estimation with Missing Data

# Estimate network for data set with missing values
result_mis <- neighborhood_net(data = mantar_dummy_mis, 
                                n_calc = "individual", 
                                missing_handling = "two-step-em", 
                                pcor_merge_rule = "and")
# View estimated partial correlations
result_mis
#>               EmoReactivity TendWorry StressSens SelfAware Moodiness  Cautious
#> EmoReactivity     0.0000000 0.1295824   0.230612 0.0000000 0.0000000 0.0000000
#> TendWorry         0.1295824 0.0000000   0.000000 0.2515697 0.0000000 0.0000000
#> StressSens        0.2306120 0.0000000   0.000000 0.0000000 0.0000000 0.0000000
#> SelfAware         0.0000000 0.2515697   0.000000 0.0000000 0.0000000 0.0000000
#> Moodiness         0.0000000 0.0000000   0.000000 0.0000000 0.0000000 0.4768098
#> Cautious          0.0000000 0.0000000   0.000000 0.0000000 0.4768098 0.0000000
#> ThoughtFuture     0.1446363 0.2991518   0.000000 0.0000000 0.0000000 0.0000000
#> RespCriticism     0.0000000 0.0000000   0.000000 0.3008107 0.1930326 0.2210164
#>               ThoughtFuture RespCriticism
#> EmoReactivity     0.1446363     0.0000000
#> TendWorry         0.2991518     0.0000000
#> StressSens        0.0000000     0.0000000
#> SelfAware         0.0000000     0.3008107
#> Moodiness         0.0000000     0.1930326
#> Cautious          0.0000000     0.2210164
#> ThoughtFuture     0.0000000     0.0000000
#> RespCriticism     0.0000000     0.0000000

# Create and view a summary of the network estimation
sum_result_mis <- summary(result_mis)
sum_result_mis
#> The density of the estimated network is 0.321
#> 
#> Network was estimated using neighborhood selection on data with missing values.
#> Missing data were handled using 'two-step-em'.
#> The penalty term was log(n) and the 'and' rule was used for edge inclusion.
#> 
#> The sample sizes used for the nodewise regressions were as follows:
#> EmoReactivity     TendWorry    StressSens     SelfAware     Moodiness 
#>           427           426           425           428           424 
#>      Cautious ThoughtFuture RespCriticism 
#>           423           422           420

r-mantar

mantar - Missingness Alleviation for NeTwork Analysis in R

Installation

Features

Example

Information Criteria

Estimation of Partial Correlation

Example of Network Estimation without Missing Data

Calculation of Sample Size

Handling Missing Data

Example of Network Estimation with Missing Data

Version

License

Status

Source

Homepage

Platforms (75)

mantar - Missingness Alleviation for NeTwork Analysis in R

Installation

Features

Example

Information Criteria

Estimation of Partial Correlation

Example of Network Estimation without Missing Data

Calculation of Sample Size

Handling Missing Data

Example of Network Estimation with Missing Data

Version

License

Status

Source

Homepage

Platforms75 (75)

Platforms (75)