Custom 'MetaphoneBR' Phonetic Encoding for Brazilian Names.
metaphonebr
The goal of metaphonebr is to simplify brazilian names phonetically using a custom metaphoneBR algorithm that preserves ending vowels, created for aiding in dataset pairing in the absence of unambiguous keys.
Installation
The package is in the process of submission to CRAN. When it is accepted, the stable version can be installed with:
install.packages("metaphonebr")
You can install the development version of metaphonebr from GitHub with :
# install.packages("remotes")
remotes::install_github("ipeadata-lab/metaphonebr")
Example
This is a basic example which shows how to use the main function:
example_names <- c("João da Silva", "Maria", "Marya",
"Helena", "Elena", "Philippe", "Filipe", "Xavier", "Chavier")
phonetic_codes <- metaphonebr::metaphonebr(example_names)
print(data.frame(original = example_names, metaphonebr = phonetic_codes))
The metaphoneBR
phonetic encoding algorithm proceeds as follows:
- Initial Cleanup & Preparation:
- Remove all diacritics (e.g., “João” becomes “Joao”).
- Convert the entire string to uppercase (e.g., “Joao” becomes “JOAO”).
- Remove all characters that are not uppercase letters (A-Z) or spaces.
- Ensure single spaces between words and trim leading/trailing whitespace.
- Silent Letter Removal:
- Remove a silent ‘H’ if it appears at the beginning of any word (e.g., “Helena” becomes “Elena”).
- Digraph Simplification (Sound Grouping):
LH
is replaced by1
(representing a palatal lateral approximant, like in “Filha” -> “FI1A”).NH
is replaced by3
(representing a palatal nasal, like in “Manhã” -> “MA3A”).CH
is replaced byX
(representing the /ʃ/ sound, like in “Chico” -> “XICO”).SH
is replaced byX
(for foreign names with /ʃ/ sound, like in “Shirley” -> “XIRLEY”).SCH
is replaced byX
(approximating /ʃ/ or /sk/, like in “Schmidt” -> “XMIT”).PH
is replaced byF
(like in “Philip” -> “FILIP”).SC
followed byE
orI
becomesS
(like in “SCENA” -> “SENA”).SC
followed byA
,O
, orU
becomesSK
(like in “ESCOVA” -> “ESKOVA”).QU
orQÜ
followed byE
orI
becomesK
(e.g., “QUEIJO” -> “KEIJO”).GU
orGÜ
followed byE
orI
becomesG
(theU
is silent, e.g., “GUERRA” -> “GERRA”).- Any remaining
QU
becomesK
(e.g., “QUANTO” -> “KANTO”).
- Similar Consonant Simplification:
Ç
is replaced byS
.C
followed byE
orI
is replaced byS
(like in “CELSO” -> “SELSO”).- Any other
C
(not part of an already transformed digraph like CH or SC) is replaced byK
(like in “CARLOS” -> “KARLOS”). G
followed byE
orI
is replaced byJ
(like in “GELO” -> “JELO”; GUE/GUI already handled).- Any remaining
Q
(that wasn’t part of QU) is replaced byK
. W
is replaced byV
(common Brazilian Portuguese pronunciation, e.g., “WALTER” -> “VALTER”).Y
is replaced byI
(e.g., “YARA” -> “IARA”).Z
is replaced byS
(e.g., “ZEBRA” -> “SEBRA”).X
preceded byS
has theX
removed (e.g., “EXCELENTE” -> “ESELENTE”, to avoid a double /s/ representation fromSKS
).
- Terminal Nasal Sound Simplification:
- A word-final
N
is replaced byM
(e.g., “JOAQUIN” -> “JOAQUIM”). - A word-final
AO
is replaced byOM
(e.g., “JOÃO” -> “JOOM”). - A word-final
ÃES
is replaced byAES
(e.g., “MÃES” -> “MAES”).
- A word-final
- Duplicate Vowel Removal:
- Sequences of identical adjacent vowels are reduced to a single vowel (e.g., “AARAO” -> “ARAO”).
- Final Cleanup (Duplicate Letters & Spaces):
- Sequences of identical adjacent letters (except if they are part of the special codes
1
for LH or3
for NH) are reduced to a single letter (e.g., “CARRO” might become “CARO”, “LESSA” becomes “LESA”. Note: This rule simplifies sounds like ‘RR’ and ‘SS’ to their single counterparts, which is a common Metaphone-style simplification). - Ensure single spaces between any remaining words and trim leading/trailing whitespace again.
- Sequences of identical adjacent letters (except if they are part of the special codes
The resulting code is an attempt to represent the phonetic signature of the name in a simplified, standardized way for a Brazilian Portuguese context. In particular, by construction it preserves ending vowels since they imply generally gender information in Brazilian Names (ex.: ADRIANO and ADRIANA).
Nota 
metaphonebr is developed by a team of researchers at Instituto de Pesquisa Econômica Aplicada (Ipea).