Unicode normalization.
Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).
Unicode Transforms
Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).
What is normalization?
Unicode characters with adornments (e.g. Á) can be represented in two different forms, as a single composed character (U+00C1 = Á) or as multiple decomposed characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte sequences but for humans they have exactly the same visual appearance.
A regular byte comparison may tell that two strings are different even though they might be equivalent. We need to convert both the strings in a normalized
form using the Unicode Character Database before we can compare them for equivalence. For example:
>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True
Performance
Normalization performance comparison of this package (v0.3.7) with the text-icu package using the ICU C++ library version ICU4C 65.1 on macOS. The benchmarks compare the time taken in milliseconds to normalize files in different languages and normalization forms using both the packages. In most cases unicode-transforms
outperforms ICU.
Benchmark unicode-transforms(ms) ICU(ms) % Diff
--------------- ---------------------- ------- --------
NFKD/Korean 7.78 37.10 +376.87
NFD/Korean 7.86 37.06 +371.50
NFKD/Vietnamese 6.85 12.48 +82.20
NFKD/Deutsch 2.17 3.55 +63.30
NFKD/English 1.71 2.78 +62.30
NFKC/Korean 4.77 7.65 +60.28
NFD/Deutsch 2.24 3.53 +57.41
NFD/English 1.76 2.77 +57.32
NFC/Vietnamese 10.66 16.63 +56.00
NFKC/Vietnamese 10.95 16.58 +51.43
NFD/Devanagari 6.48 8.68 +34.10
NFC/Devanagari 6.77 8.49 +25.48
NFD/AllChars 6.18 7.41 +19.91
NFD/Japanese 7.80 9.20 +17.99
NFKC/Devanagari 7.33 8.48 +15.74
NFKD/Japanese 8.71 10.05 +15.39
NFD/Vietnamese 5.94 6.83 +14.99
NFKD/Devanagari 7.59 8.68 +14.27
NFKD/AllChars 9.80 10.66 +8.82
NFKC/Deutsch 3.21 3.18 -0.72
NFC/Korean 4.62 4.38 -5.35
NFKC/English 2.21 2.06 -6.88
NFC/English 2.19 2.04 -7.21
NFKC/AllChars 14.67 9.75 -50.51
NFC/Deutsch 3.02 1.95 -54.39
NFKC/Japanese 12.46 5.42 -129.93
NFC/AllChars 9.72 3.58 -171.63
NFC/Japanese 11.90 3.04 -292.04
Talks
Contributing
Please use https://github.com/harendra-kumar/unicode-transforms to raise issues, or send pull requests.