Description
R Implementation of Wordpiece Tokenization.
Description
Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default.
README.md
wordpiece
The goal of wordpiece is to allow for easy text tokenization using a wordpiece vocabulary.
Installation
You can install the released version of wordpiece from CRAN with:
install.packages("wordpiece")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("macmillancontentscience/wordpiece")
Examples
This package can be used to tokenize text for modeling. A common usecase would be to tokenize all text in a data.frame or other tibble.
library(wordpiece)
library(dplyr, warn.conflicts = FALSE)
df_tokenized <- tibble(
text = c(
"I like tacos.",
"I like apples with cheese.",
"The unaffable coder wrote incorrect examples."
)
) %>%
mutate(
tokens = wordpiece_tokenize(text)
)
df_tokenized
#> # A tibble: 3 x 2
#> text tokens
#> <chr> <list>
#> 1 I like tacos. <dbl [5]>
#> 2 I like apples with cheese. <dbl [6]>
#> 3 The unaffable coder wrote incorrect examples. <dbl [10]>
df_tokenized$tokens[[1]]
#> i like ta ##cos .
#> 1045 2066 11937 13186 1012
Code of Conduct
Please note that the wordpiece project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Disclaimer
This is not an officially supported Macmillan Learning product.
Contact information
Questions or comments should be directed to Jonathan Bratt ([email protected]) and Jon Harmon ([email protected]).