Numeric Matrices K-NN and PCA Imputation.
slideimp
{slideimp} is a lightweight R package for fast K-NN and PCA imputation of missing values in high-dimensional numeric matrices.
Core functions
knn_imp(): Full-matrix K-NN imputation with multi-core parallelization,{mlpack}KD/Ball-Tree nearest neighbor implementation (for data with very low missing rates and extremely high dimensions), and optional subset imputation (ideal for epigenetic clock calculations).pca_imp(): Optimized version ofmissMDA::imputePCA()for high-dimensional numeric matrices.slide_imp(): Sliding window K-NN or PCA imputation for extremely high-dimensional numeric matrices with ordered features (i.e., by genomic position).group_imp(): Parallelizable group-wise (e.g., by chromosomes or column clusters) K-NN or PCA imputation with optional auxiliary features and group-wise parameters.group_features():group_imp()’s helper function to create groups based on a mapping data.frame (i.e., Illumina manifests). See{slideimp.extra}on GitHub for tools to process common Illumina manifests.
tune_imp(): Parallelizable hyperparameter tuning with repeated cross-validation; works with built-in or custom imputation functions.
Installation
The stable version of {slideimp} can be installed from CRAN using:
install.packages("slideimp")
You can install the development version of {slideimp} with:
pak::pkg_install("hhp94/slideimp")
Workflow
Let’s simulate some DNA methylation (DNAm) microarray data from 2 chromosomes. All {slideimp} functions expect the input to be a numeric matrix where variables are stored in the columns.
library(slideimp)
# Simulate data from 2 chromosomes
set.seed(1234)
sim_obj <- sim_mat(m = 20, n = 50, perc_NA = 0.3, perc_col_NA = 1, nchr = 2)
# Here we see that variables are stored in rows
sim_obj$input[1:5, 1:5]
#> s1 s2 s3 s4 s5
#> feat1 0.2391314 0.0000000 0.5897476 0.4201222 NA
#> feat2 NA 0.2810446 0.3677927 NA 0.6387734
#> feat3 0.7203854 0.1600776 0.5027545 NA 0.5556735
#> feat4 0.0000000 0.1816453 0.3608640 0.3356484 0.6394179
#> feat5 0.5827582 0.3774313 0.2801131 0.5047049 0.5761809
# So we t() to put the variables in columns
obj <- t(sim_obj$input)
We can optionally estimate the prediction accuracy of different methods and tune hyperparameters prior to imputation with tune_imp().
For custom functions (.f argument), the parameters data.frame must include the columns corresponding to the arguments passed to the custom function. The custom function must accept obj as the first argument and return a matrix with the same dimensions as obj.
We tune the results using 2 repeats (rep = 2) for illustration (increase in actual analyses).
knn_params <- tibble::tibble(k = c(5, 20))
# Parallelization is controlled by `cores` only for knn or slideimp knn
tune_knn <- tune_imp(obj, parameters = knn_params, cores = 2, rep = 2)
#> Tuning knn_imp
#> Step 1/2: Injecting NA
#> Running in parallel...
#> Step 2/2: Tuning
compute_metrics(tune_knn)
#> # A tibble: 12 × 7
#> k cores param_set rep .metric .estimator .estimate
#> <dbl> <dbl> <int> <int> <chr> <chr> <dbl>
#> 1 5 2 1 1 mae standard 0.178
#> 2 5 2 1 1 rmse standard 0.225
#> 3 5 2 1 1 rsq standard 0.00454
#> 4 20 2 2 1 mae standard 0.149
#> 5 20 2 2 1 rmse standard 0.190
#> 6 20 2 2 1 rsq standard 0.0172
#> 7 5 2 1 2 mae standard 0.202
#> 8 5 2 1 2 rmse standard 0.259
#> 9 5 2 1 2 rsq standard 0.00960
#> 10 20 2 2 2 mae standard 0.172
#> 11 20 2 2 2 rmse standard 0.219
#> 12 20 2 2 2 rsq standard 0.0850
For PCA and custom functions, setup parallelization with mirai::daemons().
mirai::daemons(2) # 2 Cores
# Note, for PCA and custom functions, cores is controlled by the `mirai::daemons()`
# and the `cores` argument is ignored.
# PCA imputation. Specified by the `ncp` column in the `pca_params` tibble.
pca_params <- tibble::tibble(ncp = c(1, 5))
tune_pca <- tune_imp(obj, parameters = pca_params, rep = 2)
# The parameters have `mean` and `sd` columns.
custom_params <- tibble::tibble(mean = 1, sd = 0)
# This function impute data with rnorm values of different `mean` and `sd`.
custom_function <- function(obj, mean, sd) {
missing <- is.na(obj)
obj[missing] <- rnorm(sum(missing), mean = mean, sd = sd)
return(obj)
}
tune_custom <- tune_imp(obj, parameters = custom_params, .f = custom_function, rep = 2)
mirai::daemons(0) # Close daemons
Then, preferably perform imputation by group with group_imp() if the variables can be meaningfully grouped (e.g., by chromosomes).
group_imp()allows imputation to be performed separately within defined groups (e.g., by chromosome), which significantly reduces runtime and can increase accuracy for both K-NN and PCA imputation.group_imp()requires agrouptibble, preferably created withgroup_features(), with three list-columns:features: required – a list-column where each element is a character vector of variable names to be imputed together.aux: optional – auxiliary variables to include in each group.parameters: optional – group-specific imputation parameters.
- In this example, we have data from 2 chromosomes so the
grouptibble should have two rows (one per chromosome), with the corresponding variables listed in thefeaturescolumn for each row.
PCA-based imputation with group_imp() can be parallelized using the {mirai} package, similar to how parallelization is done with tune_imp().
# Use the `group_features()` helper function
group_df <- group_features(obj, sim_obj$group_feature)
group_df
# We choose K-NN imputation, k = 5, from the `tune_imp` results.
knn_group_results <- group_imp(obj, group = group_df, k = 5, cores = 2)
# Similar to `tune_imp`, parallelization is controlled by `mirai::daemons()`
mirai::daemons(2)
knn_group_results <- group_imp(obj, group = group_df, ncp = 3)
mirai::daemons(0)
Alternatively, full matrix imputation can be performed using knn_imp() or pca_imp().
full_knn_results <- knn_imp(obj = obj, k = 5)
full_pca_results <- pca_imp(obj = obj, ncp = 5)
Sliding Window Imputation
Sliding window imputation can be performed using slide_imp(). Note: DNAm WGBS/EM-seq data should be grouped by chromosomes and converted into either beta or M values before sliding window imputation. See vignette for more details.
chr1_beta <- t(sim_mat(m = 10, n = 2000, perc_NA = 0.3, perc_col_NA = 1, nchr = 1)$input)
dim(chr1_beta)
#> [1] 10 2000
chr1_beta[1:5, 1:5]
#> feat1 feat2 feat3 feat4 feat5
#> s1 NA 0.7297743 NA NA 0.3968039
#> s2 0.7346970 NA 0.5669140 0.3236858 0.3932419
#> s3 NA NA NA 0.3108793 NA
#> s4 0.5401526 0.5779956 0.4271064 NA 0.3309645
#> s5 0.6457875 NA 0.7308792 0.4803642 0.5929590
# From the tune results, choose window size of 50, overlap of size 5 between windows,
# K-NN imputation using k = 10. Specify `ncp` for sliding window PCA imputation.
slide_imp(obj = chr1_beta, n_feat = 50, n_overlap = 5, k = 10, cores = 2, .progress = FALSE)
#> ImputedMatrix (KNN)
#> Dimensions: 10 x 2000
#>
#> feat1 feat2 feat3 feat4 feat5
#> s1 0.5067435 0.7297743 0.5884198 0.5063839 0.3968039
#> s2 0.7346970 0.4551576 0.5669140 0.3236858 0.3932419
#> s3 0.5625864 0.4790436 0.5316400 0.3108793 0.5234974
#> s4 0.5401526 0.5779956 0.4271064 0.5551127 0.3309645
#> s5 0.6457875 0.4006866 0.7308792 0.4803642 0.5929590
#>
#> # Showing [1:5, 1:5] of full matrix