Description
A variant of tokenizer-monad that supports streaming.
Description
This monad transformer is a modification of tokenizer-monad that can work on streams of text/string chunks or even on (Unicode) bytestring streams.
README.md
tokenizer-streaming
Motivation: You might have stumpled upon the package tokenizer-monad. It is another project by me, for writing tokenizers that act on pure text/strings. However, there are situations when you cannot keep all the text in memory. You might want to tokenize text from network streams or from large corpus files.
Main idea: A monad transformer called TokenizerT
implements exactly the same methods as Tokenizer
from tokenizer-monad, such that all tokenizers can be ported without code changes (if you used MonadTokenizer
in the type signatures)
Supported text types
- streams of Char lists can be tokenized into streams of Char lists
- streams of strict Text can be tokenized into streams of strict Text
- streams of lazy Text can be tokenized into streams of lazy Text
- streams of strict ASCII ByteStrings can be tokenized into streams of strict ASCII ByteStrings
- streams of lazy ASCII ByteStrings can be tokenized into streams of lazy ASCII ByteStrings
- bytestring streams (from streaming-bytestring) with Unicode encodings (UTF-8, UTF-16 LE & BE, UTF-32 LE & BE) can be tokenized into streams of strict Text.