MyNixOS website logo
Description

Access Unicode Character Database (UCD)

unicode-data provides Haskell APIs to efficiently access the Unicode character database (UCD). Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the UCD files. The latest Unicode version supported by this library is 15.0.0.

README

unicode-data provides Haskell APIs to efficiently access the Unicode character database. Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the Unicode character database (UCD) files. The latest Unicode version supported by this library is 15.0.0.

Please see the Haddock documentation for reference documentation.

Performance

unicode-data is up to 5 times faster than base.

The following benchmark compares the time taken in milliseconds to process all the Unicode code points for base-4.16 (GHC 9.2.1) and this package (v0.3). Machine: 8 × AMD Ryzen 5 2500U on Linux.

All
  Unicode.Char.Case.Compat
    isLower
      base:           OK (1.53s)
         24 ms ± 3.8 ms
      unicode-data:   OK (2.25s)
        4.4 ms ±  88 μs, 0.19x
    isUpper
      base:           OK (1.50s)
         24 ms ± 450 μs
      unicode-data:   OK (2.37s)
        4.7 ms ± 200 μs, 0.19x
    toLower
      base:           OK (1.40s)
         22 ms ± 1.8 ms
      unicode-data:   OK (1.89s)
        7.2 ms ± 297 μs, 0.32x
    toTitle
      base:           OK (1.25s)
         20 ms ± 2.0 ms
      unicode-data:   OK (1.65s)
        6.4 ms ± 509 μs, 0.32x
    toUpper
      base:           OK (1.26s)
         20 ms ± 2.5 ms
      unicode-data:   OK (1.72s)
        6.8 ms ± 335 μs, 0.34x
  Unicode.Char.General
    generalCategory
      base:           OK (2.02s)
        134 ms ± 1.6 ms
      unicode-data:   OK (1.75s)
        116 ms ± 1.6 ms, 0.87x
    isAlphaNum
      base:           OK (1.53s)
         24 ms ± 1.7 ms
      unicode-data:   OK (2.16s)
        4.2 ms ±  29 μs, 0.18x
    isControl
      base:           OK (1.47s)
         23 ms ± 2.6 ms
      unicode-data:   OK (2.23s)
        4.4 ms ±  22 μs, 0.19x
    isMark
      base:           OK (1.47s)
         23 ms ± 624 μs
      unicode-data:   OK (2.28s)
        4.5 ms ±  48 μs, 0.19x
    isPrint
      base:           OK (1.53s)
         25 ms ± 2.4 ms
      unicode-data:   OK (2.27s)
        4.4 ms ±  50 μs, 0.18x
    isPunctuation
      base:           OK (1.51s)
         24 ms ± 459 μs
      unicode-data:   OK (2.24s)
        4.4 ms ±  25 μs, 0.18x
    isSeparator
      base:           OK (1.52s)
         24 ms ± 407 μs
      unicode-data:   OK (2.43s)
        4.8 ms ±  94 μs, 0.20x
    isSymbol
      base:           OK (1.49s)
         24 ms ± 863 μs
      unicode-data:   OK (1.34s)
        5.2 ms ±  92 μs, 0.22x
  Unicode.Char.General.Compat
    isAlpha
      base:           OK (1.46s)
         23 ms ± 322 μs
      unicode-data:   OK (2.14s)
        4.1 ms ±  36 μs, 0.18x
    isLetter
      base:           OK (1.44s)
         22 ms ± 640 μs
      unicode-data:   OK (2.17s)
        4.3 ms ±  58 μs, 0.19x
    isSpace
      base:           OK (1.44s)
         11 ms ± 1.2 ms
      unicode-data:   OK (1.36s)
        5.3 ms ± 243 μs, 0.49x
  Unicode.Char.Numeric
    isNumber
      base:           OK (1.52s)
         24 ms ± 368 μs
      unicode-data:   OK (2.41s)
        4.7 ms ±  41 μs, 0.19x

Unicode database version update

To update the Unicode version please update the version number in ucd.sh.

To download the Unicode database, run ucd.sh download from the top level directory of the repo to fetch the database in ./ucd.

$ ./ucd.sh download

To generate the Haskell data structure files from the downloaded database files, run ucd.sh generate from the top level directory of the repo.

$ ./ucd.sh generate

Running property doctests

Temporarily add QuickCheck to build depends of library.

$ cabal build
$ cabal-docspec --check-properties --property-variables c

Licensing

unicode-data is an open source project available under a liberal Apache-2.0 license.

Contributing

As an open project we welcome contributions.

Metadata

Version

0.4.0.1

License

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows