Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis.
HVT: Collection of functions used to build hierarchical topology preserving maps
Zubin Dowlaty
2024-05-02
1. Abstract
The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis, see Figure 1
as an example of a 2D torus map generated from the package. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called an embedding) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map useful for semi-supervised tasks.
Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
Temporal Analysis and Visualization: A Collection of new functions that leverages the capacity of the HVT package by analyzing time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.
The HVT package allows creation of visually stunning tessellations, showcasing the power of topology preserving maps. Below is an image depicting a captivating tessellation of a torus, see vignette for more details.
Figure 1: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable z.
2. Vignettes
Following are the links to the vignettes for the HVT package:
2.1 HVT Vignette
HVT Vignette: Contains descriptions of the functions used for vector quantization and construction of hierarchical voronoi tessellations for data analysis.
2.2 HVT Model Diagnostics Vignette
HVT Model Diagnostics Vignette: Contains descriptions of functions used to perform model diagnostics and validation for HVT model.
2.3 HVT Scoring Cells with Layers using scoreLayeredHVT
HVT Scoring Cells with Layers using scoreLayeredHVT : Contains descriptions of the functions used for scoring cells with layers based on a sequence of maps using scoreLayeredHVT.
2.4 Temporal Analysis and Visualization: Leveraging Time Series Capabilities in HVT
Temporal Analysis and Visualization: Leveraging Time Series Capabilities in HVT : Contains descriptions of the functions used for analyzing time series data and its flow maps.
3. Version History
HVT (v24.5.2)
2nd May, 2024
In this version of HVT package, the following new features have been introduced:
- Updated Nomenclature: To make the function names more consistent and understandable/intuitive, we have renamed the functions throughout the package. Given below are the few instances.
HVT
totrainHVT
predictHVT
toscoreHVT
predictLayerHVT
toscoreLayeredHVT
- Restructured Functions: The functions have been rearranged and grouped into new sections which are highlighted on the index page of package’s PDF documentation. Given below are the few instances.
trainHVT
function now resides within theTraining_or_Compression
section.plotHVT
function now resides within theTessellation_and_Heatmap
section.scoreHVT
function now resides within theScoring
section.
Enhancements: The pre-existed functions,
hvtHmap
andexploded_hmap
, have been combined and incorporated into theplotHVT
function. Additionally,plotHVT
now includes the ability to perform 1D plotting.Temporal Analysis
- The new update focuses on the integration of time series capabilities into the HVT package by extending its foundational operations to time series data which is emphasized in this vignette.
- The new functionalities are introduced to analyze underlying patterns and trends within the data, providing insights into its evolution over time and also offering the capability to analyze the movement of the data by calculating its transitioning probability and creates elegant plots and GIFs.
Below are the new functions and its brief descriptions:
plotStateTransition
: Provides the time series flowmap plot.getTransitionProbability
: Provides a list of transition probabilities.reconcileTransitionProbability
: Provides plots and tables for comparing transition probabilities calculated manually and from markovchain function.plotAnimatedFlowmap
: Creates flowmaps and animations for both self state and without self state scenarios.
HVT (v23.11.02)
17th November, 2023
This version of HVT package offers functionality to score cells with layers based on a sequence of maps created using scoreLayeredHVT
. Given below are the steps to created the successive set of maps.
Map A - The output of
trainHVT
function which is trained on parent data.Map B - The output of
trainHVT
function which is trained on the 'data with novelty' created fromremoveNovelty
function.Map C - The output of
trainHVT
function which is trained on the 'data without novelty' created fromremoveNovelty
function.
The scoreLayeredHVT
function uses these three maps to score the test datapoints.
Let us try to understand the steps with the help of the diagram below
Figure 2: Data Segregation for scoring based on a sequence of maps using scoreLayeredHVT()
HVT (v22.12.06)
06th December, 2022
This version of HVT package offers features for both training an HVT model and eliminating outlier cells from the trained model.
Training or Compression: The initial step entails training the parent data using the
trainHVT
function, specifying the desired compression percentage and quantization error.Remove novelty cells: Following the training process, outlier cells can be identified manually from the 2D hvt plot. These outlier cells can then be inputted into the
removeNovelty
function, which subsequently produces two datasets in its output: one containing 'data with novelty' and the other containing 'data without novelty'.
4. Installation of HVT (v24.5.2)
library(devtools)
devtools::install_github(repo = "Mu-Sigma/HVT")