MyNixOS website logo
Description

Generation and traversal of compressed directed acyclic dawg graphs.

dawgdic

Overview

dawgdic is a library for building and accessing dictionaries implemented with directed acyclic word graphs (DAWG).

This is a ported version of C++ dawgdic library.

Features

This library offers DAWG, Dictionary, Guide and Completer data types as well as their builders.

  • DAWG represents word graph. It is provided as intermediate data structure.
  • Dictionary represents compact layout for DAWG. It offers API to check the presence of the word and associated value in the graph. Also it could be stored into a file and loaded from file.
  • Guide is being used as a tree-like index to get the faster navigation through the dictionary. Also could be stored into a file and loaded from file.

Input characters must be in range 0-255.

This port does not contain RankedGuide and RankedCompleter (yet).

Comparison with other DAWG libraries

  • dictionary file size, in bytes:
package/lexicon size (words)2.6K26K370K
dawg: Data.DAWG.Dynamic370868282366029933964
dawg: Data.DAWG.Static133138103057811128690
dawg-ord: Data.DAWG.IntN/AN/AN/A
dawg-ord: Data.DAWG.OrdN/AN/AN/A
packed-dawg: Data.DAWG.Packed156191284911481671
dawgdic: Data.DAWG.Dictionary30881413281603600
dawgdic: Guide (w/ Dictionary)46402120002405408
  • features:
featuredawgdawg-ordpacked-dawgdawgdic
build from list++++
persistence+-++
insertDynamic+-Builder
delete keyDynamic+--
follow characterStatic~edgeslookupPrefix+
lookup value++-+
member~lookup~lookup++
keys++++
values++-+
toList (assocs)++-+
complete word~submap-~toList+
fuzzy search----
max sizeN/AN/A2^22 nodesN/A

Installation

Add dawgdic dependency to your project and run cabal build.

Building and Querying

Building DAWG from lexicon of words and ignoring insertion failures is as simple as this:

>>> import Data.DAWG.DAWG
>>> dawg <- fromAscList . lines =<< readFile "/path/to/lexicon"

Otherwise, consider mutable builder. It could also be useful if words are associated with values.

buildOrError content = do
  dawgBuilder <- new
  forM_ content \(word, value) -> do
    result <- insert word (Just value) dawgBuilder
    unless result $ error "Insert failed"
  freeze dawgBuilder

Building dictionary and guide:

>>> import qualified Data.DAWG.Dictionary as D
>>> import qualified Data.DAWG.Guide as G
>>> dict <- D.build' dawg
>>> guide <- G.build' dawg dict

Saving dictionary and guide:

>>> D.write "dict.dawg" dict
>>> G.write "guide.dawg" guide

Loading dictionary and guide:

>>> dict <- D.load "dict.dawg"
>>> guide <- G.load "guide.dawg"

Consider using Completer for auto-complete-like queries or if you need to obtain lexicon back from Dictionary and Guide.

Benchmarks

Utilities

Benchmark                             default(μs)
------------------------------------- ------------
Utilities/10/Dawg.fromAscList             5549.76
Utilities/10/Dict.build'                 26981.64
Utilities/10/Dict.contains                  22.46
Utilities/10/Dict.lookup                    22.68
Utilities/10/Dict.follow                    22.42
Utilities/10/Guide.build'                   76.81
Utilities/10/Completer.completeKeys        385.35
------------------------------------- ------------
Utilities/100/Dawg.fromAscList           66486.21
Utilities/100/Dict.build'               194881.38
Utilities/100/Dict.contains                326.16
Utilities/100/Dict.lookup                  323.45
Utilities/100/Dict.follow                  319.86
Utilities/100/Guide.build'                 782.24
Utilities/100/Completer.completeKeys      7016.36
------------------------------------- ------------
Utilities/1000/Dawg.fromAscList         888061.61
Utilities/1000/Dict.build'             1659798.44
Utilities/1000/Dict.contains              3627.54
Utilities/1000/Dict.lookup                3638.10
Utilities/1000/Dict.follow                3564.64
Utilities/1000/Guide.build'               7992.73
Utilities/1000/Completer.completeKeys    82343.20
------------------------------------- ------------

How to reproduce:

  • Install bench-show:
cabal install bench-show --overwrite-policy=always
  • Run and wait (it might take around 3m to complete):
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS  --benchmark-options="--output bench.html --csv results.csv"
  • Generate report:
bench-show --presentation=Solo report results.csv

Comparison

Benchmark                      default(μs)
------------------------------ -----------
10/dawgdic.follow                    21.04
10/dawg.follow                       85.41
10/dawg-ord.follow                  110.42
10/packed-dawg.follow               147.30
100/dawgdic.follow                  311.05
100/dawg.follow                    1367.81
100/dawg-ord.follow                1774.78
100/packed-dawg.follow             2098.12
1000/dawgdic.follow                3273.02
1000/dawg.follow                  18842.47
1000/dawg-ord.follow              21992.03
1000/packed-dawg.follow           28938.01
------------------------------ -----------
10/dawgdic.lookup value              21.74
10/dawg.lookup value                 61.89
10/dawg-ord.lookup value            209.93
100/dawgdic.lookup value            290.13
100/dawg.lookup value               847.27
100/dawg-ord.lookup value          2973.35
1000/dawgdic.lookup value          3551.44
1000/dawg.lookup value            14217.29
1000/dawg-ord.lookup value        40111.28
10/dawgdic.member                    21.02
------------------------------ -----------
10/dawg.member                       56.43
10/dawg-ord.member                  191.74
10/packed-dawg.member               173.97
100/dawgdic.member                  343.32
100/dawg.member                     873.70
100/dawg-ord.member                2863.96
100/packed-dawg.member             2200.33
1000/dawgdic.member                9288.22
1000/dawg.member                  13878.72
1000/dawg-ord.member              31428.47
1000/packed-dawg.member           29089.83
------------------------------ -----------
10/dawgdic.keys                     108.33
10/dawg.keys                        100.23
10/dawg-ord.keys                    160.51
10/packed-dawg.keys                  50.50
100/dawgdic.keys                   1392.78
100/dawg.keys                      1413.00
100/dawg-ord.keys                  2324.99
100/packed-dawg.keys                692.23
1000/dawgdic.keys                 15628.83
1000/dawg.keys                    16541.79
1000/dawg-ord.keys                27612.77
1000/packed-dawg.keys              7415.48
------------------------------ -----------
10/dawgdic.values                    52.26
10/dawg.values                       72.29
10/dawg-ord.values                  134.79
100/dawgdic.values                  616.02
100/dawg.values                    1060.87
100/dawg-ord.values                1794.82
1000/dawgdic.values                6457.32
1000/dawg.values                  11734.25
1000/dawg-ord.values              21532.54
------------------------------ -----------
10/dawgdic.toList                   107.26
10/dawg.toList                      116.89
10/dawg-ord.toList                  202.35
100/dawgdic.toList                 1367.87
100/dawg.toList                    1608.26
100/dawg-ord.toList                2898.26
1000/dawgdic.toList               16334.34
1000/dawg.toList                  18578.59
1000/dawg-ord.toList              36360.44
------------------------------ -----------
10/dawgdic.complete word            367.97
10/dawg.complete word               248.14
10/packed-dawg.complete word        303.69
100/dawgdic.complete word          6151.51
100/dawg.complete word             4494.85
100/packed-dawg.complete word      4913.83
1000/dawgdic.complete word        71152.49
1000/dawg.complete word           58222.76
1000/packed-dawg.complete word    60828.53
------------------------------ -----------

How to reproduce:

  • Install bench-show:
cabal install bench-show --overwrite-policy=always
  • Run and wait (it might take around 7m to complete):
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS  --benchmark-options="--output bench.html --csv results.csv"
  • Generate report:
bench-show --presentation=Solo report results.csv

Contributing

In cases of issues please attach callstack, provide minimal dictionary lexicon and provide logs with enabled tracing.

cabal build -ftrace

Acknowledgments

Metadata

Version

0.1.0

Platforms (76)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows