MyNixOS website logo
Description

Dunning t-digest for online quantile estimation.

A pure functional implementation of the Dunning t-digest data structure (merging digest variant, K1 arcsine scale function) using finger trees with four-component monoidal measures for O(log n) insertion and queries. . Also provides a mutable variant backed by mutable vectors in the ST monad. . The t-digest provides streaming, mergeable, memory-bounded approximation of quantile (percentile) queries with high accuracy in the tails. . Features: . * O(log n) insertion via split-by-mean (no buffering needed) * O(log n) quantile queries via split-by-cumulative-weight * O(log n) CDF queries via split-by-mean * O(δ log n) compression via split-based greedy merge * O(1) total weight, centroid count, and chunk mean computation * Mutable variant with O(1) amortized insertion via buffering

dunning-t-digest

Streaming quantile estimation with bounded space and high tail accuracy.

When you have millions of latency measurements, sensor readings, or financial tick data flowing through a pipeline, you often need to answer questions like "what was the 99th-percentile response time?" without storing every observation. The t-digest is a compact sketch data structure that answers such questions in bounded memory, with accuracy that is highest precisely where it matters most — in the extreme tails of the distribution.

This package provides two Haskell implementations of the merging t-digest with the K₁ (arcsine) scale function, as described by Dunning (2021) and Dunning & Ertl (2019):

  • Data.Sketch.TDigest — a purely functional implementation backed by a finger tree with a four-component monoidal measure, providing O(log n) insertion, query, and merge operations with no mutable state.

  • Data.Sketch.TDigest.Mutable — a mutable implementation in the ST monad using buffer-and-flush with greedy merge over mutable vectors, providing O(1) amortised insertion for high-throughput ingestion.

Both variants guarantee O(δ) space usage regardless of how many data points are ingested, where δ is a user-chosen compression parameter (default 100).

Quick start

import Data.Sketch.TDigest

main :: IO ()
main = do
  let td = foldl' (flip add) empty [1.0 .. 100000.0]
  print (quantile 0.99 td)   -- 99th percentile
  print (quantile 0.999 td)  -- 99.9th percentile
  print (cdf 50000.0 td)     -- fraction ≤ 50000

The mutable variant is useful for tight inner loops:

import Data.Sketch.TDigest.Mutable
import Control.Monad (forM_)

result :: Maybe Double
result = runTDigest $ do
  td <- new
  forM_ [1.0 .. 100000.0] $ \v -> add v td
  quantile 0.99 td

Why another t-digest on Hackage?

The existing tdigest package is a fine implementation. This package differs in several ways that may or may not matter for your use case:

dunning-t-digesttdigest
Internal structureFinger tree (pure) / mutable vectors (ST)Balanced tree
Scale functionK₁ arcsineK₂
Mutable variantYes (ST monad)No
Module namespaceData.Sketch.*Data.TDigest.*
Monoidal measureFour-component (weight, count, maxMean, meanWeightSum) for O(1) chunk statistics

The choice of scale function affects the accuracy profile: K₁ provides tighter bounds at extreme quantiles (q < 0.01 or q > 0.99) at the cost of slightly looser bounds near the median, while K₂ distributes accuracy more uniformly. See §3 of Dunning & Ertl (2019) for a detailed comparison.

How it works

A t-digest maintains a sorted collection of centroids — (mean, weight) pairs that summarise clusters of nearby data points. The key idea is that centroids near the tails of the distribution (q ≈ 0 or q ≈ 1) are kept small, while centroids in the interior may grow large. This non-uniform resolution is governed by the scale function:

$$k_1(q, \delta) = \frac{\delta}{2\pi} \arcsin(2q - 1)$$

Two adjacent quantile positions q₀ and q₁ may share a centroid if and only if k₁(q₁) − k₁(q₀) ≤ 1. Because the arcsine function has infinite slope at 0 and 1, this forces centroids at the extremes to carry very little weight, yielding high relative accuracy for tail queries.

After each compression pass the digest holds at most O(δ) centroids. With the default δ = 100, this means roughly 100–300 centroids regardless of whether you have ingested ten thousand or ten billion data points.

The pure implementation

The pure module stores centroids in a finger tree — a general-purpose sequence data structure due to Hinze & Paterson (2006) — annotated with a four-component monoidal measure that enables:

  • Split-by-mean for O(log n) insertion of new points at the correct sorted position.
  • Split-by-cumulative-weight for O(log n) quantile queries.
  • O(1) chunk mean computation during compression, via the cached sum of mean × weight products.
  • O(1) centroid count and total weight without traversal.

Compression performs a single left-to-right greedy merge over the tree, yielding O(δ log n) amortised cost.

The mutable implementation

The mutable module follows the buffer-and-flush strategy recommended by Dunning & Ertl (2019): incoming values are appended to an unsorted buffer in O(1) amortised time. When the buffer reaches capacity (5δ), it is flushed into the sorted centroid array via insertion sort followed by a single-pass greedy merge. Prefix sums are maintained for O(log n) binary-search queries.

The ST monad provides true in-place mutation with rank-2 type safety — no mutable reference can escape the runTDigest block — while avoiding the overhead of persistent data structures.

Part of a larger project

This package is one of 28 language implementations of the merging t-digest maintained in the NadiaYvette/t-digest repository. The full set spans Ada, C, C++, C#, Common Lisp, D, Elixir, Erlang, Fortran, Go, Haskell, Java, Julia, Kotlin, Lua, Mercury, Nim, OCaml, Perl, Prolog, Python, R, Ruby, Rust, Scheme, Standard ML, Swift, and Zig. Twenty-two of the mutable implementations store centroids in array-backed 2-3-4 trees for cache-friendly O(log n) worst-case operations.

API overview

Pure (Data.Sketch.TDigest)

FunctionDescription
empty, emptyWithCreate a digest (default δ = 100)
add, addWeightedInsert a value
quantileEstimate value at quantile q ∈ [0, 1]
cdfEstimate CDF at a given value
compressForce a compression pass
mergeMerge two digests
totalWeight, centroidCountSummary statistics
centroidListExtract centroids for serialisation
fromComponentsReconstruct from components
getDelta, getMin, getMaxAccessors

Mutable (Data.Sketch.TDigest.Mutable)

FunctionDescription
new, newWithCreate a mutable digest in ST
add, addWeightedInsert a value (O(1) amortised)
quantile, cdfQuery (auto-compresses)
compressForce a flush-and-merge cycle
mergeMerge a pure digest into a mutable one
freeze, thawConvert between pure and mutable
runTDigestRun an ST computation to a pure result

References

  • T. Dunning (2021). The t-digest: Efficient estimates of distributions. Software Impacts 7, 100049. doi:10.1016/j.simpa.2020.100049

  • T. Dunning & O. Ertl (2019). Computing extremely accurate quantiles using t-digests. arXiv:1902.04023. arxiv.org/abs/1902.04023

  • M. Greenwald & S. Khanna (2001). Space-efficient online computation of quantile summaries. SIGMOD '01. doi:10.1145/375663.375670

  • J.I. Munro & M.S. Paterson (1980). Selection and sorting with limited storage. Theoretical Computer Science 12(3), 315–323. doi:10.1016/0304-3975(80)90061-4

  • R. Hinze & R. Paterson (2006). Finger trees: A simple general-purpose data structure. Journal of Functional Programming 16(2), 197–217. doi:10.1017/S0956796805005769

License

BSD-3-Clause. Copyright (c) 2025 Nadia Yvette Chambers.

Metadata

Version

0.1.0.1

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows