Generation and traversal of compressed directed acyclic dawg graphs.
dawgdic
Overview
dawgdic is a library for building and accessing dictionaries implemented with directed acyclic word graphs (DAWG).
This is a ported version of C++ dawgdic library.
Features
This library offers DAWG, Dictionary, Guide and Completer data types as well as their builders.
DAWGrepresents word graph. It is provided as intermediate data structure.Dictionaryrepresents compact layout forDAWG. It offers API to check the presence of the word and associated value in the graph. Also it could be stored into a file and loaded from file.Guideis being used as a tree-like index to get the faster navigation through the dictionary. Also could be stored into a file and loaded from file.
Input characters must be in range 0-255.
This port does not contain RankedGuide and RankedCompleter (yet).
Comparison with other DAWG libraries
- dictionary file size, in bytes:
| package/lexicon size (words) | 2.6K | 26K | 370K |
|---|---|---|---|
| dawg: Data.DAWG.Dynamic | 370868 | 2823660 | 29933964 |
| dawg: Data.DAWG.Static | 133138 | 1030578 | 11128690 |
| dawg-ord: Data.DAWG.Int | N/A | N/A | N/A |
| dawg-ord: Data.DAWG.Ord | N/A | N/A | N/A |
| packed-dawg: Data.DAWG.Packed | 15619 | 128491 | 1481671 |
| dawgdic: Data.DAWG.Dictionary | 3088 | 141328 | 1603600 |
| dawgdic: Guide (w/ Dictionary) | 4640 | 212000 | 2405408 |
- features:
| feature | dawg | dawg-ord | packed-dawg | dawgdic |
|---|---|---|---|---|
| build from list | + | + | + | + |
| persistence | + | - | + | + |
| insert | Dynamic | + | - | Builder |
| delete key | Dynamic | + | - | - |
| follow character | Static | ~edges | lookupPrefix | + |
| lookup value | + | + | - | + |
| member | ~lookup | ~lookup | + | + |
| keys | + | + | + | + |
| values | + | + | - | + |
| toList (assocs) | + | + | - | + |
| complete word | ~submap | - | ~toList | + |
| fuzzy search | - | - | - | - |
| max size | N/A | N/A | 2^22 nodes | N/A |
Installation
Add dawgdic dependency to your project and run cabal build.
Building and Querying
Building DAWG from lexicon of words and ignoring insertion failures is as simple as this:
>>> import Data.DAWG.DAWG
>>> dawg <- fromAscList . lines =<< readFile "/path/to/lexicon"
Otherwise, consider mutable builder. It could also be useful if words are associated with values.
buildOrError content = do
dawgBuilder <- new
forM_ content \(word, value) -> do
result <- insert word (Just value) dawgBuilder
unless result $ error "Insert failed"
freeze dawgBuilder
Building dictionary and guide:
>>> import qualified Data.DAWG.Dictionary as D
>>> import qualified Data.DAWG.Guide as G
>>> dict <- D.build' dawg
>>> guide <- G.build' dawg dict
Saving dictionary and guide:
>>> D.write "dict.dawg" dict
>>> G.write "guide.dawg" guide
Loading dictionary and guide:
>>> dict <- D.load "dict.dawg"
>>> guide <- G.load "guide.dawg"
Consider using Completer for auto-complete-like queries or if you need to obtain lexicon back from Dictionary and Guide.
Benchmarks
Utilities
Benchmark default(μs)
------------------------------------- ------------
Utilities/10/Dawg.fromAscList 5549.76
Utilities/10/Dict.build' 26981.64
Utilities/10/Dict.contains 22.46
Utilities/10/Dict.lookup 22.68
Utilities/10/Dict.follow 22.42
Utilities/10/Guide.build' 76.81
Utilities/10/Completer.completeKeys 385.35
------------------------------------- ------------
Utilities/100/Dawg.fromAscList 66486.21
Utilities/100/Dict.build' 194881.38
Utilities/100/Dict.contains 326.16
Utilities/100/Dict.lookup 323.45
Utilities/100/Dict.follow 319.86
Utilities/100/Guide.build' 782.24
Utilities/100/Completer.completeKeys 7016.36
------------------------------------- ------------
Utilities/1000/Dawg.fromAscList 888061.61
Utilities/1000/Dict.build' 1659798.44
Utilities/1000/Dict.contains 3627.54
Utilities/1000/Dict.lookup 3638.10
Utilities/1000/Dict.follow 3564.64
Utilities/1000/Guide.build' 7992.73
Utilities/1000/Completer.completeKeys 82343.20
------------------------------------- ------------
How to reproduce:
- Install
bench-show:
cabal install bench-show --overwrite-policy=always
- Run and wait (it might take around
3mto complete):
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS --benchmark-options="--output bench.html --csv results.csv"
- Generate report:
bench-show --presentation=Solo report results.csv
Comparison
Benchmark default(μs)
------------------------------ -----------
10/dawgdic.follow 21.04
10/dawg.follow 85.41
10/dawg-ord.follow 110.42
10/packed-dawg.follow 147.30
100/dawgdic.follow 311.05
100/dawg.follow 1367.81
100/dawg-ord.follow 1774.78
100/packed-dawg.follow 2098.12
1000/dawgdic.follow 3273.02
1000/dawg.follow 18842.47
1000/dawg-ord.follow 21992.03
1000/packed-dawg.follow 28938.01
------------------------------ -----------
10/dawgdic.lookup value 21.74
10/dawg.lookup value 61.89
10/dawg-ord.lookup value 209.93
100/dawgdic.lookup value 290.13
100/dawg.lookup value 847.27
100/dawg-ord.lookup value 2973.35
1000/dawgdic.lookup value 3551.44
1000/dawg.lookup value 14217.29
1000/dawg-ord.lookup value 40111.28
10/dawgdic.member 21.02
------------------------------ -----------
10/dawg.member 56.43
10/dawg-ord.member 191.74
10/packed-dawg.member 173.97
100/dawgdic.member 343.32
100/dawg.member 873.70
100/dawg-ord.member 2863.96
100/packed-dawg.member 2200.33
1000/dawgdic.member 9288.22
1000/dawg.member 13878.72
1000/dawg-ord.member 31428.47
1000/packed-dawg.member 29089.83
------------------------------ -----------
10/dawgdic.keys 108.33
10/dawg.keys 100.23
10/dawg-ord.keys 160.51
10/packed-dawg.keys 50.50
100/dawgdic.keys 1392.78
100/dawg.keys 1413.00
100/dawg-ord.keys 2324.99
100/packed-dawg.keys 692.23
1000/dawgdic.keys 15628.83
1000/dawg.keys 16541.79
1000/dawg-ord.keys 27612.77
1000/packed-dawg.keys 7415.48
------------------------------ -----------
10/dawgdic.values 52.26
10/dawg.values 72.29
10/dawg-ord.values 134.79
100/dawgdic.values 616.02
100/dawg.values 1060.87
100/dawg-ord.values 1794.82
1000/dawgdic.values 6457.32
1000/dawg.values 11734.25
1000/dawg-ord.values 21532.54
------------------------------ -----------
10/dawgdic.toList 107.26
10/dawg.toList 116.89
10/dawg-ord.toList 202.35
100/dawgdic.toList 1367.87
100/dawg.toList 1608.26
100/dawg-ord.toList 2898.26
1000/dawgdic.toList 16334.34
1000/dawg.toList 18578.59
1000/dawg-ord.toList 36360.44
------------------------------ -----------
10/dawgdic.complete word 367.97
10/dawg.complete word 248.14
10/packed-dawg.complete word 303.69
100/dawgdic.complete word 6151.51
100/dawg.complete word 4494.85
100/packed-dawg.complete word 4913.83
1000/dawgdic.complete word 71152.49
1000/dawg.complete word 58222.76
1000/packed-dawg.complete word 60828.53
------------------------------ -----------
How to reproduce:
- Install
bench-show:
cabal install bench-show --overwrite-policy=always
- Run and wait (it might take around
7mto complete):
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS --benchmark-options="--output bench.html --csv results.csv"
- Generate report:
bench-show --presentation=Solo report results.csv
Contributing
In cases of issues please attach callstack, provide minimal dictionary lexicon and provide logs with enabled tracing.
cabal build -ftrace
Acknowledgments
- Susumu Yata as original author of C++ library.