MyNixOS website logo
Description

Implementation of the DISCO Metric for Internal Clustering Evaluation.

Implementation of the DISCO (Density-based Internal Score for Clusterings with nOise) metric, a cluster validity index for evaluating density-based clustering results without ground truth labels. DISCO is the first index to explicitly assess the quality of noise point assignments in addition to cluster quality. It uses density-connectivity distance derived from a minimum spanning tree of the mutual-reachability graph, providing interpretable, bounded scores in [-1, 1]. Higher scores indicate better clustering. Based on Beer et al. (2025) <doi:10.48550/arXiv.2503.00127>.

discoCVI

Implementation of the DISCO Metric for Internal Clustering Evaluation

R implementation of the DISCO (Density-based Internal Score for Clustering Outcomes) metric — a Cluster Validity Index (CVI) for evaluating density-based clustering results, including explicit evaluation of noise point quality. Translated from the original Python reference implementation by Beer, Krieger, Weber et al. (2025).

License: MIT R >= 3.6 Paper


Table of Contents


Overview

discoCVI evaluates clustering quality without ground truth labels by using density-connectivity distance (dc-distance) instead of Euclidean distance. It is the first CVI that:

  • Handles arbitrary-shaped clusters (not just convex/spherical)
  • Explicitly evaluates noise point quality (not just cluster quality)
  • Returns bounded, interpretable scores in [−1, 1]

Why DISCO?

Traditional CVIs like Silhouette and Davies-Bouldin assume convex clusters and ignore noise. On non-convex data, they rank the wrong algorithm higher:

AlgorithmSilhouetteDavies-BouldinDISCO
DBSCAN (correct)−0.16 ❌2.8 (bad) ❌0.68 ✅
K-Means (wrong)0.37 ✅0.6 (good) ✅0.22 ❌

DISCO correctly identifies DBSCAN as better on non-convex data.


Installation

# Install from GitHub
devtools::install_github("aminentezari/discoCVI")

Dependencies (installed automatically):

  • FNN — fast k-nearest neighbor search

Quick Start

library(discoCVI)
library(dbscan)

# 1. Generate synthetic data
data   <- make_circles(n_samples = 300, noise = 0.05, factor = 0.4, random_state = 42)
X      <- data$X

# 2. Apply clustering
db_labels <- dbscan(X, eps = 0.2, minPts = 5)$cluster - 1L  # 0 -> -1 for noise
km_labels <- kmeans(X, centers = 2, nstart = 10)$cluster - 1L

# 3. Evaluate
disco_score(X, db_labels)   # -> 0.6900921813
disco_score(X, km_labels)   # -> 0.0270091908

# 4. Per-point scores
scores <- disco_samples(X, db_labels)
summary_disco_scores(scores)

# 5. Visualise
plot_disco_scores(X, db_labels, scores, main = "DBSCAN - DISCO Scores")

# 6. Compare algorithms
compare_clusterings(X, list(DBSCAN = db_labels, KMeans = km_labels))

Function Reference

Core Functions

disco_score(X, labels, min_points = 5)

Returns a single number summarising overall clustering quality.

ArgumentTypeDescription
Xmatrix / data.framen x p feature matrix
labelsinteger vectorCluster labels; use -1 for noise
min_pointsintegerk for dc-distance. Default: 5

Returns: Single numeric in [-1, 1].

disco_samples(X, labels, min_points = 5)

Returns per-point scores — useful for identifying problematic points.

scores     <- disco_samples(X, labels)
bad_points <- which(scores < 0)

p_cluster(X, labels, min_points, precomputed_dc_dists)

Silhouette-style scores using dc-distances. -1 is treated as a valid cluster here.

p_noise(X, labels, min_points, dc_dists)

Returns list(p_sparse, p_far) for noise points only.

compute_dc_distances(X, min_points = 5)

Returns the n x n DC-distance matrix directly.

D  <- compute_dc_distances(X)
s1 <- p_cluster(D, labels1, precomputed_dc_dists = TRUE)
s2 <- p_cluster(D, labels2, precomputed_dc_dists = TRUE)

Utility Functions

FunctionDescription
make_circles(n, noise, factor, seed)Concentric circles dataset
make_moons(n, noise, seed)Two interleaving crescents
make_blobs(n, centers, std, seed)Gaussian blobs with variable density
plot_disco_scores(X, labels, scores)Red-yellow-green scatter plot
compare_clusterings(X, list(...))Rank multiple clusterings by DISCO
summary_disco_scores(scores)Mean, median, SD, quantiles, n_negative

Algorithm

Step 1 - Core Distance

k(x) = d_euclidean(x, x_{min_points-1})

Step 2 - Mutual Reachability Distance

dm(x, y) = max( k(x), k(y), d_euclidean(x, y) )

Step 3 - DC-Distance via MST

dc(x, y) = max edge weight on minimax path x -> y in MST

Step 4 - DISCO Score per Point

Cluster points:

a = mean dc-distance to own cluster (excluding self)
b = min over other clusters of mean dc-distance
p(x) = (b - a) / max(a, b)

Noise points:

p_sparse = (k(n) - k_max(C)) / max(k(n), k_max(C))
p_far    = (dc_min(n,C) - k_max(C)) / max(dc_min(n,C), k_max(C))
p(n)     = min(p_sparse, p_far)

Overall:

DISCO = (1/n) * sum(p(x_i))

Implementation Notes

This R package is a line-by-line translation of the Python reference (disco.py + dctree.py). Three bugs were found and fixed:

Bug 1 - Core Distance Off-by-One

Python includes self (dist=0) in k-NN, so effective neighbor is k-1. FNN::get.knn excludes self:

k          <- min_points - 1
knn_result <- FNN::get.knn(X, k = k)
core_dists <- knn_result$nn.dist[, k]

Bug 2 - MST Tie-Breaking

igraph::mst breaks equal-weight edges differently than Python's Prim. Fixed by porting Python's exact _get_mst_edges() (dctree.py lines 401-438).

Bug 3 - K-Means Cross-Language Non-Determinism

R and Python RNGs produce different K-Means partitions from the same seed. Fixed by exporting Python labels as CSV and loading in R:

# Python
pd.DataFrame({"km_label": km.labels_}).to_csv("python_km_labels.csv", index=False)
# R
km_labels <- as.integer(read.csv("python_km_labels.csv")$km_label)

After all fixes, R and Python produce identical scores to 10 decimal places:

DISCO - DBSCAN  : 0.6900921813  (R == Python)
DISCO - K-Means : 0.0270091908  (R == Python)

Experimental Results

Eight benchmark datasets validated. All differences are within machine epsilon (< 1e-14).

DatasetnClustersDBSCANK-MeansMatch
Concentric Circles30020.69009218130.0270091908R == Python
Two Moons100020.81795983770.2832395719R == Python
3-Spiral31230.5836514046-0.0014176359R == Python
Complex9303190.42094645140.0551096775R == Python
Complex8255180.07186876520.0122574749R == Python
Dartboard1100040.8743426861-0.0033636711R == Python
Blobs45030.85489531630.8575691834R == Python
Diagonal Blobs60030.72155362530.7043253397R == Python

Note: Negative K-Means scores on Dartboard1 and 3-Spiral are correct and expected — DISCO penalises algorithms that cut through ring or spiral structures.


File Structure

discoCVI/
├── R/
│   └── disco.R              <- All functions (single combined file)
├── man/                     <- Auto-generated by devtools::document()
├── DESCRIPTION
├── NAMESPACE                <- Auto-generated by devtools::document()
├── LICENSE
└── README.md

Results Interpretation

ScoreMeaning
0.7 - 1.0Excellent - well-separated, compact clusters
0.4 - 0.7Good - clear cluster structure
0.0 - 0.4Moderate - overlapping clusters
-0.3 - 0.0Poor - misassigned points
-1.0 - -0.3Very poor - clustering worse than random

Common Pitfall

# WRONG: dbscan() uses 0 for noise
labels <- dbscan(X, eps = 0.2)$cluster

# CORRECT: DISCO uses -1 for noise
labels <- dbscan(X, eps = 0.2)$cluster - 1L

Citation

@article{beer2025disco,
  title   = {DISCO: Internal Evaluation of Density-Based Clustering},
  author  = {Beer, Anna and Krieger, Lena and Weber, Pascal and
             Ritzert, Martin and Assent, Ira and Plant, Claudia},
  journal = {arXiv preprint arXiv:2503.00127},
  year    = {2025}
}

R Package:

Amin Entezari (2025). discoCVI: Implementation of the DISCO Metric for
Internal Clustering Evaluation. R package.
https://github.com/aminentezari/discoCVI

License

MIT © Amin Entezari

Original Python implementation by Beer, Krieger, Weber, Ritzert, Assent, Plant (2025).

For questions: [email protected].

Metadata

Version

0.1.1

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows