Custom 'MetaphoneBR' Phonetic Encoding for Brazilian Names.
metaphonebr
The goal of metaphonebr is to simplify brazilian names phonetically using a custom metaphoneBR algorithm that preserves ending vowels, created for aiding in dataset pairing in the absence of unambiguous keys.
Installation
The package is in the process of submission to CRAN. When it is accepted, the stable version can be installed with:
install.packages("metaphonebr")
You can install the development version of metaphonebr from GitHub with :
# install.packages("remotes")
remotes::install_github("ipeadata-lab/metaphonebr")
Example
This is a basic example which shows how to use the main function:
example_names <- c("João da Silva", "Maria", "Marya",
"Helena", "Elena", "Philippe", "Filipe", "Xavier", "Chavier")
phonetic_codes <- metaphonebr::metaphonebr(example_names)
print(data.frame(original = example_names, metaphonebr = phonetic_codes))
The metaphoneBR phonetic encoding algorithm proceeds as follows:
- Initial Cleanup & Preparation:
- Remove all diacritics (e.g., “João” becomes “Joao”).
- Convert the entire string to uppercase (e.g., “Joao” becomes “JOAO”).
- Remove all characters that are not uppercase letters (A-Z) or spaces.
- Ensure single spaces between words and trim leading/trailing whitespace.
- Silent Letter Removal:
- Remove a silent ‘H’ if it appears at the beginning of any word (e.g., “Helena” becomes “Elena”).
- Digraph Simplification (Sound Grouping):
LHis replaced by1(representing a palatal lateral approximant, like in “Filha” -> “FI1A”).NHis replaced by3(representing a palatal nasal, like in “Manhã” -> “MA3A”).CHis replaced byX(representing the /ʃ/ sound, like in “Chico” -> “XICO”).SHis replaced byX(for foreign names with /ʃ/ sound, like in “Shirley” -> “XIRLEY”).SCHis replaced byX(approximating /ʃ/ or /sk/, like in “Schmidt” -> “XMIT”).PHis replaced byF(like in “Philip” -> “FILIP”).SCfollowed byEorIbecomesS(like in “SCENA” -> “SENA”).SCfollowed byA,O, orUbecomesSK(like in “ESCOVA” -> “ESKOVA”).QUorQÜfollowed byEorIbecomesK(e.g., “QUEIJO” -> “KEIJO”).GUorGÜfollowed byEorIbecomesG(theUis silent, e.g., “GUERRA” -> “GERRA”).- Any remaining
QUbecomesK(e.g., “QUANTO” -> “KANTO”).
- Similar Consonant Simplification:
Çis replaced byS.Cfollowed byEorIis replaced byS(like in “CELSO” -> “SELSO”).- Any other
C(not part of an already transformed digraph like CH or SC) is replaced byK(like in “CARLOS” -> “KARLOS”). Gfollowed byEorIis replaced byJ(like in “GELO” -> “JELO”; GUE/GUI already handled).- Any remaining
Q(that wasn’t part of QU) is replaced byK. Wis replaced byV(common Brazilian Portuguese pronunciation, e.g., “WALTER” -> “VALTER”).Yis replaced byI(e.g., “YARA” -> “IARA”).Zis replaced byS(e.g., “ZEBRA” -> “SEBRA”).Xpreceded byShas theXremoved (e.g., “EXCELENTE” -> “ESELENTE”, to avoid a double /s/ representation fromSKS).
- Terminal Nasal Sound Simplification:
- A word-final
Nis replaced byM(e.g., “JOAQUIN” -> “JOAQUIM”). - A word-final
AOis replaced byOM(e.g., “JOÃO” -> “JOOM”). - A word-final
ÃESis replaced byAES(e.g., “MÃES” -> “MAES”).
- A word-final
- Duplicate Vowel Removal:
- Sequences of identical adjacent vowels are reduced to a single vowel (e.g., “AARAO” -> “ARAO”).
- Final Cleanup (Duplicate Letters & Spaces):
- Sequences of identical adjacent letters (except if they are part of the special codes
1for LH or3for NH) are reduced to a single letter (e.g., “CARRO” might become “CARO”, “LESSA” becomes “LESA”. Note: This rule simplifies sounds like ‘RR’ and ‘SS’ to their single counterparts, which is a common Metaphone-style simplification). - Ensure single spaces between any remaining words and trim leading/trailing whitespace again.
- Sequences of identical adjacent letters (except if they are part of the special codes
The resulting code is an attempt to represent the phonetic signature of the name in a simplified, standardized way for a Brazilian Portuguese context. In particular, by construction it preserves ending vowels since they imply generally gender information in Brazilian Names (ex.: ADRIANO and ADRIANA).
Nota 
metaphonebr is developed by a team of researchers at Instituto de Pesquisa Econômica Aplicada (Ipea).