Description
Approximate String Matching, Fuzzy Text Search, and String Distance Functions.
Description
Implements an approximate string matching version of R's native 'match' function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well. Reference: MPJ van der Loo (2014) <doi:10.32614/RJ-2014-011>.
README.md
stringdist
- Approximate matching and string distance calculations for R.
- All distance and matching operations are system- and encoding-independent.
- Built for speed, using openMP for parallel computing.
The package offers the following main functions:
stringdistcomputes pairwise distances between two input character vectors (shorter one is recycled)stringdistmatrixcomputes the distance matrix for one or two vectorsstringsimcomputes a string similarity between 0 and 1, based onstringdistamatchis a fuzzy matching equivalent of R's nativematchfunctionainis a fuzzy matching equivalent of R's native%in%operatorseq_dist,seq_distmatrix,seq_amatchandseq_ainfor distances between, and matching of integer sequences.
These functions are built upon C-code that re-implements some common (weighted) string distance functions. Distance functions include:
- Hamming distance;
- Levenshtein distance (weighted)
- Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment)
- Full Damerau-Levenshtein distance
- Longest Common Substring distance
- Q-gram distance
- cosine distance for q-gram count vectors (= 1-cosine similarity)
- Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
- Jaro, and Jaro-Winkler distance
- Soundex-based string distance
Also, there are some utility functions:
qgrams()tabulates the qgrams in one or morecharactervectors.seq_qrams()tabulates the qgrams (somtimes called ngrams) in one or moreintegervectors.phonetic()computes phonetic codes of strings (currently only soundex)printable_ascii()is a utility function that detects non-printable ascii or non-ascii characters.
C API
Some of stringdist's underlying C functions can be called directly from C code in other packages. The description of the API can be found by either typing ?stringdist_api in the R console or open the vignette directly as follows:
vignette("stringdist_C-Cpp_api", package="stringdist")
Examples of packages that link to stringdist can be found here and here.