OPTICS K-Xi Density-Based Clustering.
OPTICS k-Xi
This R package provides a novel cluster extraction method for the OPTICS algorithm, OPTICS k-Xi, along with ggplot2 visualizations and a framework to compare clustering models with varying parameters using distance-based metrics.
Summary
Density-based clustering methods are well adapted to the clustering of high-dimensional data and enable the discovery of core groups of various shapes despite large amounts of noise.
The opticskxi R package provides a novel density-based cluster extraction method, OPTICS k-Xi, and a framework to compare k-Xi models using distance-based metrics to investigate datasets with unknown number of clusters. The vignette first introduces density-based algorithms with simulated datasets, then presents and evaluates the k-Xi cluster extraction method. Finally, the models comparison framework is described and experimented on 2 genetic datasets to identify groups and their discriminating features.
The k-Xi algorithm is a novel OPTICS cluster extraction method that specifies directly the number of clusters and does not require fine-tuning of the steepness parameter as the OPTICS Xi method. Combined with a framework that compares models with varying parameters, the OPTICS k-Xi method can identify groups in noisy datasets with unknown number of clusters.
Installation
Using the devtools package in R:
devtools::install_git('https://framagit.org/thomaschln/opticskxi.git')
Usage
Compute OPTICS profile and k-Xi clustering
data('multishapes')
optics_shapes <- dbscan::optics(multishapes[1:2])
kxi_shapes <- opticskxi(optics_shapes, n_xi = 5, pts = 30)
Visualize with ggplot2
ggplot_optics(optics_shapes)
ggplot_kxi_profile(kxi_shapes)
Compare multiple k-Xi models in dataset with unknown number of clusters and visualize the best models:
- Compute k-Xi models with varying parameters and their distance-based metrics
data('hla')
m_hla <- hla[-c(1:2)] %>% scale
df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30, 40),
dist = c('manhattan', 'euclidean', 'abscorrelation', 'abspearson'))
df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla)
- Visualize the metrics and OPTICS profiles of the models with highest average silhouette width
ggplot_kxi_metrics(df_kxi_hla, n = 8)
gtable_kxi_profiles(df_kxi_hla) %>% plot
- Extract the second best model and visualize the clusters using PCA dimension reduction
best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2)
clusters_hla <- best_kxi_hla$clusters
fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>%
ggpairs('Clusters', ellipses = TRUE, variables = TRUE)
See the vignette for results and further details.
Acknowledgements
This work was inspired by Jérôme Wojcik (Precision for Medicine) and Sviatoslav Voloshynovskiy (University of Geneva).
License
This package is free and open source software, licensed under GPL-3.