MyNixOS website logo
Description

R Implementation of Wordpiece Tokenization.

Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default.

wordpiece

Lifecycle:experimental

The goal of wordpiece is to allow for easy text tokenization using a wordpiece vocabulary.

Installation

You can install the released version of wordpiece from CRAN with:

install.packages("wordpiece")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("macmillancontentscience/wordpiece")

Examples

This package can be used to tokenize text for modeling. A common usecase would be to tokenize all text in a data.frame or other tibble.

library(wordpiece)
library(dplyr, warn.conflicts = FALSE)
df_tokenized <- tibble(
  text = c(
    "I like tacos.",
    "I like apples with cheese.",
    "The unaffable coder wrote incorrect examples."
  )
) %>% 
  mutate(
    tokens = wordpiece_tokenize(text)
  )

df_tokenized
#> # A tibble: 3 x 2
#>   text                                          tokens    
#>   <chr>                                         <list>    
#> 1 I like tacos.                                 <dbl [5]> 
#> 2 I like apples with cheese.                    <dbl [6]> 
#> 3 The unaffable coder wrote incorrect examples. <dbl [10]>
df_tokenized$tokens[[1]]
#>     i  like    ta ##cos     . 
#>  1045  2066 11937 13186  1012

Code of Conduct

Please note that the wordpiece project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Disclaimer

This is not an officially supported Macmillan Learning product.

Contact information

Questions or comments should be directed to Jonathan Bratt ([email protected]) and Jon Harmon ([email protected]).

Metadata

Version

2.1.3

License

Unknown

Platforms (75)

    Darwin
    FreeBSD 13
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd13
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd13
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows