Description
Approximate String Matching, Fuzzy Text Search, and String Distance Functions.
Description
Implements an approximate string matching version of R's native 'match' function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well. Reference: MPJ van der Loo (2014) <doi:10.32614/RJ-2014-011>.
README.md
stringdist
- Approximate matching and string distance calculations for R.
- All distance and matching operations are system- and encoding-independent.
- Built for speed, using openMP for parallel computing.
The package offers the following main functions:
stringdist
computes pairwise distances between two input character vectors (shorter one is recycled)stringdistmatrix
computes the distance matrix for one or two vectorsstringsim
computes a string similarity between 0 and 1, based onstringdist
amatch
is a fuzzy matching equivalent of R's nativematch
functionain
is a fuzzy matching equivalent of R's native%in%
operatorseq_dist
,seq_distmatrix
,seq_amatch
andseq_ain
for distances between, and matching of integer sequences.
These functions are built upon C
-code that re-implements some common (weighted) string distance functions. Distance functions include:
- Hamming distance;
- Levenshtein distance (weighted)
- Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment)
- Full Damerau-Levenshtein distance
- Longest Common Substring distance
- Q-gram distance
- cosine distance for q-gram count vectors (= 1-cosine similarity)
- Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
- Jaro, and Jaro-Winkler distance
- Soundex-based string distance
Also, there are some utility functions:
qgrams()
tabulates the qgrams in one or morecharacter
vectors.seq_qrams()
tabulates the qgrams (somtimes called ngrams) in one or moreinteger
vectors.phonetic()
computes phonetic codes of strings (currently only soundex)printable_ascii()
is a utility function that detects non-printable ascii or non-ascii characters.
C API
Some of stringdist
's underlying C
functions can be called directly from C
code in other packages. The description of the API can be found by either typing ?stringdist_api
in the R console or open the vignette directly as follows:
vignette("stringdist_C-Cpp_api", package="stringdist")
Examples of packages that link to stringdist
can be found here and here.