Description
Sequential Outlier Identification for Model-Based Clustering.
Description
Sequential outlier identification for Gaussian mixture models using the distribution of Mahalanobis distances. The optimal number of outliers is chosen based on the dissimilarity between the theoretical and observed distributions of the scaled squared sample Mahalanobis distances. Also includes an extension for Gaussian linear cluster-weighted models using the distribution of studentized residuals. Doherty, McNicholas, and White (2025) <doi:10.48550/arXiv.2505.11668>.
README.md
outlierMBC
Ultán P. Doherty 2025-05-08
Gaussian Mixture Models
gross_gmm_k3n1000o10 <- find_gross(gmm_k3n1000o10[, 1:2], max_out = 20)
ombc_gmm_k3n1000o10 <- ombc_gmm(
gmm_k3n1000o10[, 1:2], comp_num = 3, max_out = 20, gross_outs = gross_gmm_k3n1000o10$gross_bool
)
print(ombc_gmm_k3n1000o10)
## Starting number of data points: 1010
## Maximum number of outliers: 20
## Number of gross outliers: 5
## Final number of outliers: 10 (minimum dissimilarity)
plot(ombc_gmm_k3n1000o10)
gmm_k3n1000o10 |>
mutate("ombc" = as.factor(ombc_gmm_k3n1000o10$labels), G = as.factor(G)) |>
ggplot(aes(x = X1, y = X2, colour = ombc, shape = G)) +
geom_point() +
labs(colour = "outlierMBC", shape = "Simulation") +
ggokabeito::scale_colour_okabe_ito(order = c(9, 1:3))
Linear Cluster-Weighted Models
gross_lcwm_k3n1000o10 <- find_gross(lcwm_k3n1000o10[, 1:2], max_out = 20)
ombc_lcwm_k3n1000o10 <- ombc_gmm(
lcwm_k3n1000o10[, 1:2], comp_num = 3, max_out = 20, gross_outs = gross_lcwm_k3n1000o10$gross_bool
)
print(ombc_lcwm_k3n1000o10)
## Starting number of data points: 1010
## Maximum number of outliers: 20
## Number of gross outliers: 0
## Final number of outliers: 10 (minimum dissimilarity)
plot(ombc_lcwm_k3n1000o10)
lcwm_k3n1000o10 |>
mutate("ombc" = as.factor(ombc_lcwm_k3n1000o10$labels), G = as.factor(G)) |>
ggplot(aes(x = X1, y = Y, colour = ombc, shape = G)) +
geom_point() +
labs(colour = "outlierMBC", shape = "Simulation") +
ggokabeito::scale_colour_okabe_ito(order = c(9, 1:3))