Description

Generation and traversal of compressed directed acyclic dawg graphs.

README.md

dawgdic

Overview

dawgdic is a library for building and accessing dictionaries implemented with directed acyclic word graphs (DAWG).

This is a ported version of C++ dawgdic library.

Features

This library offers DAWG, Dictionary, Guide and Completer data types as well as their builders.

DAWG represents word graph. It is provided as intermediate data structure.
Dictionary represents compact layout for DAWG. It offers API to check the presence of the word and associated value in the graph. Also it could be stored into a file and loaded from file.
Guide is being used as a tree-like index to get the faster navigation through the dictionary. Also could be stored into a file and loaded from file.

Input characters must be in range 0-255.

This port does not contain RankedGuide and RankedCompleter (yet).

Comparison with other DAWG libraries

dictionary file size, in bytes:

package/lexicon size (words)	2.6K	26K	370K
dawg: Data.DAWG.Dynamic	370868	2823660	29933964
dawg: Data.DAWG.Static	133138	1030578	11128690
dawg-ord: Data.DAWG.Int	N/A	N/A	N/A
dawg-ord: Data.DAWG.Ord	N/A	N/A	N/A
packed-dawg: Data.DAWG.Packed	15619	128491	1481671
dawgdic: Data.DAWG.Dictionary	3088	141328	1603600
dawgdic: Guide (w/ Dictionary)	4640	212000	2405408

features:

feature	dawg	dawg-ord	packed-dawg	dawgdic
build from list	+	+	+	+
persistence	+	-	+	+
insert	Dynamic	+	-	Builder
delete key	Dynamic	+	-	-
follow character	Static	~edges	lookupPrefix	+
lookup value	+	+	-	+
member	~lookup	~lookup	+	+
keys	+	+	+	+
values	+	+	-	+
toList (assocs)	+	+	-	+
complete word	~submap	-	~toList	+
fuzzy search	-	-	-	-
max size	N/A	N/A	2^22 nodes	N/A

Installation

Add dawgdic dependency to your project and run cabal build.

Building and Querying

Building DAWG from lexicon of words and ignoring insertion failures is as simple as this:

>>> import Data.DAWG.DAWG
>>> dawg <- fromAscList . lines =<< readFile "/path/to/lexicon"

Otherwise, consider mutable builder. It could also be useful if words are associated with values.

buildOrError content = do
  dawgBuilder <- new
  forM_ content \(word, value) -> do
    result <- insert word (Just value) dawgBuilder
    unless result $ error "Insert failed"
  freeze dawgBuilder

Building dictionary and guide:

>>> import qualified Data.DAWG.Dictionary as D
>>> import qualified Data.DAWG.Guide as G
>>> dict <- D.build' dawg
>>> guide <- G.build' dawg dict

Saving dictionary and guide:

>>> D.write "dict.dawg" dict
>>> G.write "guide.dawg" guide

Loading dictionary and guide:

>>> dict <- D.load "dict.dawg"
>>> guide <- G.load "guide.dawg"

Consider using Completer for auto-complete-like queries or if you need to obtain lexicon back from Dictionary and Guide.

Benchmarks

Utilities

Benchmark                             default(μs)
------------------------------------- ------------
Utilities/10/Dawg.fromAscList             5549.76
Utilities/10/Dict.build'                 26981.64
Utilities/10/Dict.contains                  22.46
Utilities/10/Dict.lookup                    22.68
Utilities/10/Dict.follow                    22.42
Utilities/10/Guide.build'                   76.81
Utilities/10/Completer.completeKeys        385.35
------------------------------------- ------------
Utilities/100/Dawg.fromAscList           66486.21
Utilities/100/Dict.build'               194881.38
Utilities/100/Dict.contains                326.16
Utilities/100/Dict.lookup                  323.45
Utilities/100/Dict.follow                  319.86
Utilities/100/Guide.build'                 782.24
Utilities/100/Completer.completeKeys      7016.36
------------------------------------- ------------
Utilities/1000/Dawg.fromAscList         888061.61
Utilities/1000/Dict.build'             1659798.44
Utilities/1000/Dict.contains              3627.54
Utilities/1000/Dict.lookup                3638.10
Utilities/1000/Dict.follow                3564.64
Utilities/1000/Guide.build'               7992.73
Utilities/1000/Completer.completeKeys    82343.20
------------------------------------- ------------

How to reproduce:

Install bench-show:

cabal install bench-show --overwrite-policy=always

Run and wait (it might take around 3m to complete):

time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS  --benchmark-options="--output bench.html --csv results.csv"

Generate report:

bench-show --presentation=Solo report results.csv

Comparison

Benchmark                      default(μs)
------------------------------ -----------
10/dawgdic.follow                    21.04
10/dawg.follow                       85.41
10/dawg-ord.follow                  110.42
10/packed-dawg.follow               147.30
100/dawgdic.follow                  311.05
100/dawg.follow                    1367.81
100/dawg-ord.follow                1774.78
100/packed-dawg.follow             2098.12
1000/dawgdic.follow                3273.02
1000/dawg.follow                  18842.47
1000/dawg-ord.follow              21992.03
1000/packed-dawg.follow           28938.01
------------------------------ -----------
10/dawgdic.lookup value              21.74
10/dawg.lookup value                 61.89
10/dawg-ord.lookup value            209.93
100/dawgdic.lookup value            290.13
100/dawg.lookup value               847.27
100/dawg-ord.lookup value          2973.35
1000/dawgdic.lookup value          3551.44
1000/dawg.lookup value            14217.29
1000/dawg-ord.lookup value        40111.28
10/dawgdic.member                    21.02
------------------------------ -----------
10/dawg.member                       56.43
10/dawg-ord.member                  191.74
10/packed-dawg.member               173.97
100/dawgdic.member                  343.32
100/dawg.member                     873.70
100/dawg-ord.member                2863.96
100/packed-dawg.member             2200.33
1000/dawgdic.member                9288.22
1000/dawg.member                  13878.72
1000/dawg-ord.member              31428.47
1000/packed-dawg.member           29089.83
------------------------------ -----------
10/dawgdic.keys                     108.33
10/dawg.keys                        100.23
10/dawg-ord.keys                    160.51
10/packed-dawg.keys                  50.50
100/dawgdic.keys                   1392.78
100/dawg.keys                      1413.00
100/dawg-ord.keys                  2324.99
100/packed-dawg.keys                692.23
1000/dawgdic.keys                 15628.83
1000/dawg.keys                    16541.79
1000/dawg-ord.keys                27612.77
1000/packed-dawg.keys              7415.48
------------------------------ -----------
10/dawgdic.values                    52.26
10/dawg.values                       72.29
10/dawg-ord.values                  134.79
100/dawgdic.values                  616.02
100/dawg.values                    1060.87
100/dawg-ord.values                1794.82
1000/dawgdic.values                6457.32
1000/dawg.values                  11734.25
1000/dawg-ord.values              21532.54
------------------------------ -----------
10/dawgdic.toList                   107.26
10/dawg.toList                      116.89
10/dawg-ord.toList                  202.35
100/dawgdic.toList                 1367.87
100/dawg.toList                    1608.26
100/dawg-ord.toList                2898.26
1000/dawgdic.toList               16334.34
1000/dawg.toList                  18578.59
1000/dawg-ord.toList              36360.44
------------------------------ -----------
10/dawgdic.complete word            367.97
10/dawg.complete word               248.14
10/packed-dawg.complete word        303.69
100/dawgdic.complete word          6151.51
100/dawg.complete word             4494.85
100/packed-dawg.complete word      4913.83
1000/dawgdic.complete word        71152.49
1000/dawg.complete word           58222.76
1000/packed-dawg.complete word    60828.53
------------------------------ -----------

How to reproduce:

Install bench-show:

cabal install bench-show --overwrite-policy=always

Run and wait (it might take around 7m to complete):

time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS  --benchmark-options="--output bench.html --csv results.csv"

Generate report:

bench-show --presentation=Solo report results.csv

Contributing

In cases of issues please attach callstack, provide minimal dictionary lexicon and provide logs with enabled tracing.

cabal build -ftrace

Acknowledgments

Susumu Yata as original author of C++ library.

dawgdic

dawgdic

Overview

Features

Comparison with other DAWG libraries

Installation

Building and Querying

Benchmarks

Utilities

How to reproduce:

Comparison

How to reproduce:

Contributing

Acknowledgments

Version

License

Status

Source

Homepage

Platforms (76)

dawgdic

Overview

Features

Comparison with other DAWG libraries

Installation

Building and Querying

Benchmarks

Utilities

How to reproduce:

Comparison

How to reproduce:

Contributing

Acknowledgments

Version

License

Status

Source

Homepage

Platforms76 (76)

Platforms (76)