MyNixOS website logo
Description

Text Tokenization using Byte Pair Encoding and Unigram Modelling.

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

sentencepiece

This repository contains an R package which is an Rcpp wrapper around the sentencepiece C++ library

  • sentencepiece is an unsupervised tokeniser which allows to execute text tokenization using
    • Byte Pair Encoding
    • Unigrams
    • Words
    • Characters
  • It is based on the paper SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing [Taku Kudo, John Richardson.]
  • The sentencepiece C++ code is available at https://github.com/google/sentencepiece
    • This package currently wraps release v0.1.96
  • This R package has similar functionalities as the R package https://github.com/bnosac/tokenizers.bpe

Features

The R package allows you to

  • build a Byte Pair Encoding (BPE), Unigram, Char or Word model
  • apply the model to encode text
  • apply the model to decode ids back to text
  • download pretrained sentencepiece models built on Wikipedia

Installation

  • For regular users, install the package from your local CRAN mirror install.packages("sentencepiece")
  • For installing the development version of this package: remotes::install_github("bnosac/sentencepiece")

Look to the documentation of the functions

help(package = "sentencepiece")

Example on encoding / decoding with a pretrained model built on Wikipedia

library(sentencepiece)
dl    <- sentencepiece_download_model("English", vocab_size = 50000)
model <- sentencepiece_load_model(dl$file_model)
model
Sentencepiece model
  size of the vocabulary: 50000
  model stored at: C:/Users/Jan/Documents/R/win-library/3.5/sentencepiece/models/en.wiki.bpe.vs50000.model
txt <- c("Give me back my Money or I'll call the police.",
         "Talk to the hand because the face don't want to hear it any more.")
txt <- tolower(txt)
sentencepiece_encode(model, txt, type = "subwords")
[[1]]
 [1] "▁give"   "▁me"     "▁back"   "▁my"     "▁money"  "▁or"     "▁i"      "'"       "ll"      "▁call"   "▁the"    "▁police" "."      

[[2]]
 [1] "▁talk"    "▁to"      "▁the"     "▁hand"    "▁because" "▁the"     "▁face"    "▁don"     "'"        "t"        "▁want"    "▁to"      "▁hear"    "▁it"      "▁any"     "▁more"    "."
sentencepiece_encode(model, txt, type = "ids")
[[1]]
 [1]  3090   352   810  1241  2795   127   386 49937  1188   612     7  2142 49935

[[2]]
 [1]  4252    42     7  1197   936     7  3227  1616 49937 49915  4451    42  6800   107   756   407 49935

Example on training

  • As an example, let's take some training data containing questions asked in Belgian Parliament in 2017 and focus on French text only.
library(tokenizers.bpe)
library(sentencepiece)
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
writeLines(text = x$text, con = "traindata.txt")
  • Train a model on text data and inspect the vocabulary
model <- sentencepiece("traindata.txt", type = "bpe", coverage = 0.999, vocab_size = 5000, 
                       model_dir = getwd(), verbose = FALSE)
model
Sentencepiece model
  size of the vocabulary: 5000
  model stored at: sentencepiece.model
str(model$vocabulary)
'data.frame':	5000 obs. of  2 variables:
 $ id     : int  0 1 2 3 4 5 6 7 8 9 ...
 $ subword: chr  "<unk>" "<s>" "</s>" "es" ...
  • Use the model to encode text
text <- c("L'appartement est grand & vraiment bien situe en plein centre",
          "Proportion de femmes dans les situations de famille monoparentale.")
sentencepiece_encode(model, x = text, type = "subwords")
[[1]]
 [1] "▁L"      "'"       "app"     "ar"      "tement"  "▁est"    "▁grand"  "▁"       "&"       "▁v"      "r"       "ai"      "ment"    "▁bien"   "▁situe"  "▁en"     "▁plein"  "▁centre"

[[2]]
 [1] "▁Pro"        "por"         "tion"        "▁de"         "▁femmes"     "▁dans"       "▁les"        "▁situations" "▁de"         "▁famille"    "▁mon"        "op"          "ar"          "ent"         "ale"         "." 
sentencepiece_encode(model, x = text, type = "ids")
[[1]]
 [1]   75 4951  252   31  461  109  960 4934    0   49 4941   34   32  585 4225   44 3356 1915

[[2]]
 [1] 1362 4159   25    9 2060   93   40 3825    9 2923  705  247   31   19  116 4953
  • Use the model to decode byte pair encodings back to text
x <- sentencepiece_encode(model, x = text, type = "ids")
sentencepiece_decode(model, x)
[[1]]
[1] "L'appartement est grand  ⁇  vraiment bien situe en plein centre"

[[2]]
[1] "Proportion de femmes dans les situations de famille monoparentale."

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be.

Metadata

Version

0.2.3

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows