MyNixOS website logo
Description

A high-performance HTML tokenizer.

This package provides a fast and reasonably robust HTML5 tokenizer built upon the attoparsec library. The parsing strategy is based upon the HTML5 parsing specification with few deviations.

For instance,

>>> parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" [],
TagOpen "h1" [Attr "class" "widget"],
ContentText "Hello World",
TagClose "h1",
TagSelfClose "br" []]

The package targets similar use-cases to the venerable tagsoup library, but is significantly more efficient, achieving parsing speeds of over 80 megabytes per second on modern hardware and typical web documents. Here are some typical performance numbers taken from parsing a Wikipedia article of moderate length:

benchmarking Forced/tagsoup fast Text
time                 186.1 ms   (175.3 ms .. 194.6 ms)
0.999 R²   (0.995 R² .. 1.000 R²)
mean                 191.7 ms   (188.9 ms .. 198.3 ms)
std dev              5.053 ms   (1.092 ms .. 6.809 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Forced/tagsoup normal Text
time                 189.7 ms   (182.8 ms .. 197.7 ms)
0.999 R²   (0.998 R² .. 1.000 R²)
mean                 196.5 ms   (193.1 ms .. 202.1 ms)
std dev              5.481 ms   (2.141 ms .. 7.383 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Forced/html-parser
time                 15.81 ms   (15.75 ms .. 15.89 ms)
1.000 R²   (1.000 R² .. 1.000 R²)
mean                 15.72 ms   (15.66 ms .. 15.77 ms)
std dev              140.9 μs   (113.6 μs .. 174.5 μs)
Metadata

Version

0.2.1.0

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows