MyNixOS website logo
Description

Simple unsupervised segmentation model.

Implementation of the non-parametric segmentation model described in "Type-based MCMC" (Liang, Jordan, and Klein, 2010) and "A Bayesian framework for word segmentation Exploring the effects of context" (Goldwater, Griffiths, and Johnson, 2009).

haskseg

Compiling

First install Stack somewhere on your PATH. For example, for ~/.local/bin:

wget https://get.haskellstack.org/stable/linux-x86_64.tar.gz -O -|tar xpfz - -C /tmp
cp /tmp/stack-*/stack ~/.local/bin
rm -rf /tmp/stack-*

Then, while in the directory of this README file, run:

stack build

The first time this runs will take a while, 10 or 15 minutes, as it builds an entire Haskell environment from scratch. Subsequent compilations are very fast.

Running

Invoke the program using Stack. To see available sub-commands, run:

stack exec -- haskseg -h

To see detailed help, run e.g.:

stack exec -- haskseg train -h

To train on the included data set from Goldwater, Griffiths, and Johnson 2009, run:

time zcat data/br-old-gold.txt.gz | perl -pe '$_=~s/ /\<GOLD\>/g;' | stack exec -- haskseg train --stateFile model.gz --iterations 3 --goldString "<GOLD>"

Note here the spaces are being replaced by a special string, to indicate boundaries to calculate F-score on (not necessary, but a nice way to track progress). The model seems to converge after three iterations on this data set. This takes about three minutes and achieves F-score of 0.61, somewhat higher than reported in Liang, Jordan and Klein 2010 (why is unclear). To use the trained model to segment text, such as the training data set, run:

time zcat data/br-old-gold.txt.gz | perl -pe '$_=~s/ //g;' | stack exec -- haskseg segment --stateFile model.gz

Note that for this stage, spaces are simply removed, otherwise it treats whitespace as a static boundary (which may be what you want in other circumstances!). The model takes about 4 seconds to segment the 9790 "words", about 2500 per second, though this stage is embarassingly parallel. The output is in BPE format, i.e. with "@@" indicating the end of an internal morph.

Metadata

Version

0.1.0.3

Executables (1)

  • bin/haskseg

Platforms (77)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows