Grammar-based compression algorithms SEQUITUR.
Please see the README on GitHub at https://github.com/msakai/haskell-sequitur#readme
Haskell implementation of SEQUITUR algorithm
About SEQUITUR
SEQUITUR is a linear-time, online algorithm for producing a context-free grammar from an input sequence. The resulting grammar is a compact representation of the original sequence and can be used for data compression.
Example:
Input string:
abcabcabcabcabc
Resulting grammar
S
→AAB
A
→BB
B
→abc
SEQUITUR consumes input symbols one-by-one and appends each symbol at the end of the grammar's start production (S
in the above example), then substitutes repeating patterns in the given sequence with new rules. SEQUITUR maintains two invariants:
Digram Uniqueness: SEQUITUR ensures that no digram (a.k.a. bigram) occurs more than once in the grammar. If a digram (e.g.
ab
) occurs twice, SEQUITUR introduces a fresh non-terminal symbol (e.g.M
) and a rule (e.g.M
→ab
) and replaces original occurrences with the newly introduced non-terminals. One exception is the cases where two occurrences overlap.Rule Utility: If a non-terminal symbol occurs only once, SEQUITUR removes the associated rule and substitutes the occurrence with the right-hand side of the rule.
Usage
ghci> import Language.Grammar.Sequitur
ghci> encode "baaabacaa"
Grammar {unGrammar = fromList [(0,[NonTerminal 1,NonTerminal 2,NonTerminal 1,Terminal 'c',NonTerminal 2]),(1,[Terminal 'b',Terminal 'a']),(2,[Terminal 'a',Terminal 'a'])]}
The output represents the following grammar:
0 → 1 2 1 c 2
1 → b a
2 → a a
References
- Sequitur algorithm - Wikipedia
- sequitur.info
- Nevill-Manning, C.G. and Witten, I.H. (1997) "Identifying Hierarchical Structure in Sequences: A linear-time algorithm," Journal of Artificial Intelligence Research, 7, 67-82.
- nikitadanilov/sequuntur.