MyNixOS website logo
Description

VALID Inference for Clusters Separation Testing.

Given a partition resulting from any clustering algorithm, the implemented tests allow valid post-clustering inference by testing if a given variable significantly separates two of the estimated clusters. Methods are detailed in: Hivert B, Agniel D, Thiebaut R & Hejblum BP (2022). "Post-clustering difference testing: valid inference and practical considerations", <arXiv:2210.13172>.

VALIDICLUST

R-CMD-check

Overview

VALIDICLUST (VALID Inference for CLUsters Separation Testing) is a package for ensuring valid inference after data clustering. This problem occurs when clustering forces differences between groups of observations to build clusters. This leads to an inflation of the Type I error rate, especially because the data is used twice: i) to build clusters, i.e., to form hypotheses, and ii) to perform the inference step. The 3 main functions of the package are:

  • test_selective_inference() following the work of Gao et al. [1] on selective inference for clustering

  • merge_selective_inference() for the merging selective test. In this method, all adjacent p-values are merged using the harmonic mean

  • test_multimod() uses the DipTest [2] to investigate the presence of a continuum between two estimated clusters for a given variable

Installation

Install the development version from GitHub

remotes::install_github("benhvt/VALIDICLUST")

Example

To illustrate our proposed method, we use the palmerpenguins dataset from the palmerpenguins package. To ensure that there are no truly separate groups of observations, we selected only the female penguins of the Adelie species, yielding 73 observations. On this dataset, we apply hierarchical clustering (on Euclidean distances with Ward linkage) to build 2 clusters. In this negative control dataset, we know the true, i.e., the non-existing separated group of observations. We therefore apply our 3 tests to each of the 4 numerical measurements to test for separation of the 2 clusters, and compare the resulting p-values to the p-values of the classical t-test.

data("penguins")
data <- subset(penguins, species == "Adelie" & sex == "female")
data <- data[,3:6]

# Clustering function 
hcl2 <- function(x){
  x <- scale(x, center = T, scale = T)
  distance <- dist(x, method = "euclidean")
  cah <- hclust(distance, method = "ward.D2")
  return(as.factor(cutree(cah, k=2)))
}

data$Cluster <- hcl2(data)

ggpairs(data, aes(colour = Cluster, fill = Cluster)) + 
  scale_colour_manual(values = c("#F1786D","#53888f")) + 
  scale_fill_manual(values = c("#F1786D","#53888f")) + 
  theme_minimal()

pval.inference.selective <- pval.merge <- pval.multimod <- pval.t.test <- rep(NA, 4)
for (i in 1:4){
  pval.inference.selective[i] <- test_selective_inference(as.matrix(data[,1:4]), 
                                                          k1=1, 
                                                          k2=2, 
                                                          g=i, 
                                                          cl_fun = hcl2, 
                                                          cl=data$Cluster)$pval
  pval.merge[i] <- merge_selective_inference(as.matrix(data[,1:4]),
                                             k1=1, 
                                             k2=2, 
                                             g=i, 
                                             cl_fun = hcl2,
                                             cl=data$Cluster)$pval
  pval.multimod[i] <- test_multimod(as.matrix(data[,1:4]),
                                    g=i,
                                    k1=1, 
                                    k2=2,
                                    cl = data$Cluster)$pval
  pval.t.test[i] <- t.test(data[data$Cluster == 1,i], data[data$Cluster==2, i])$p.value
}


pval.res <- rbind(pval.inference.selective, 
                  pval.merge,
                  pval.multimod,
                  pval.t.test)

colnames(pval.res) <- gsub("_", x=colnames(data)[1:4], replacement = " ")
rownames(pval.res) <- c("Selective Inference",
                        "Merging Selective Inference",
                        "Multimoality test", 
                        "T-test")

round(pval.res,3) %>% htmlTable::htmlTable()
bill length mm bill depth mm flipper length mm body mass g
Selective Inference 0.156 0.716 0.484 0.438
Merging Selective Inference 0.153 0.76 0.492 0.48
Multimoality test 0.806 0.193 0.044 0.599
T-test 0.4 0 0 0

References

[1] Gao, L. L., Bien, J., & Witten, D. (2022). Selective inference for hierarchical clustering. Journal of the American Statistical Association, (just-accepted), 1-27.

[2] HARTIGAN, John A. et HARTIGAN, Pamela M. The dip test of unimodality. The annals of Statistics, 1985, p. 70-84.

Metadata

Version

0.1.0

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows