Description
Methods for Clustering Mixed-Type Data.
Description
Implements methods for clustering mixed-type data, specifically combinations of continuous and nominal data. Special attention is paid to the often-overlooked problem of equitably balancing the contribution of the continuous and categorical variables. This package implements KAMILA clustering, a novel method for clustering mixed-type data in the spirit of k-means clustering. It does not require dummy coding of variables, and is efficient enough to scale to rather large data sets. Also implemented is Modha-Spangler clustering, which uses a brute-force strategy to maximize the cluster separation simultaneously in the continuous and categorical variables. For more information, see Foss, Markatou, Ray, & Heching (2016) <doi:10.1007/s10994-016-5575-7> and Foss & Markatou (2018) <doi:10.18637/jss.v083.i13>.
README.md
kamila
R package for clustering mixed data. For more information, install the package and run
library(kamila)
?`kamila-package`
from the R terminal. For an in-depth discussion of the challenges involved in clustering mixed-type data, please see our papers:
- Foss, Markatou, Ray, and Heching (2016). A semiparametric method for clustering mixed data. Machine Learning, 105(3), 419-458. DOI: 10.1007/s10994-016-5575-7
- Foss and Markatou (2018). kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). DOI: 10.18637/jss.v083.i13
- Foss, Markatou, and Ray (2018). Distance Metrics and Clustering Methods for Mixed-Type Data. International Statistical Review. DOI: 10.1111/insr.12274.
Update May 11, 2020: Update data.frame calls to be compatible with R v4.x.x.