MyNixOS website logo
Description

Multilevel Supervised Topic Models with Multiple Outcomes.

Fits latent Dirichlet allocation (LDA), supervised topic models, and multilevel supervised topic models for text data with multiple outcome variables. Core estimation routines are implemented in C++ using the 'Rcpp' ecosystem. For topic models, see Blei et al. (2003) <https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf>. For supervised topic models, see Blei and McAuliffe (2007) <https://papers.nips.cc/paper_files/paper/2007/hash/d56b9fc4b0f1be8871f5e1c40c0067e7-Abstract.html>.

mlstm

mlstm: Multilevel Supervised Topic Models with Multiple Outcomes in R


Overview

mlstm implements Multilevel Supervised Topic Models (MLSTM), a probabilistic framework for analyzing text data with multiple associated outcome variables.

Unlike standard supervised topic models that assume a single response per document, MLSTM allows multiple outcomes and introduces a hierarchical regression structure to share information across them.

The package provides efficient variational inference algorithms implemented in C++ via Rcpp, enabling scalable estimation for large text corpora.


Key Features

  • Multi-output supervised topic modeling
  • Hierarchical regression structure across outcomes
  • Variational Bayesian inference (fast and scalable)
  • Supports missing outcome values
  • C++ backend via RcppParallel for performance

Installation

# install.packages("remotes")
remotes::install_github("thimeno1993/mlstm")

Quick Example

Simulated corpus

library(mlstm)
set.seed(123)

D <- 50
V <- 200
K <- 5

NZ_per_doc <- 20
NZ <- D * NZ_per_doc

count <- cbind(
  d = rep(0:(D - 1), each = NZ_per_doc),
  v = sample.int(V, NZ, replace = TRUE) - 1L,
  c = rpois(NZ, 3) + 1
)

Y <- cbind(
  y1 = rnorm(D),
  y2 = rnorm(D)
)

LDA

mod_lda <- run_lda_gibbs(
  count = count,
  K     = K,
  alpha = 0.1,
  beta  = 0.01,
  n_iter = 20,
  verbose = FALSE
)

str(mod_lda$theta)
str(mod_lda$phi)

Supervised Topic Model (STM)

y <- Y[, 1]

set_threads(2)

mod_stm <- run_stm_vi(
  count = count,
  y     = y,
  K     = K,
  alpha = 0.1,
  beta  = 0.01,
  max_iter = 50,
  min_iter = 10,
  verbose  = FALSE
)

y_hat <- ((mod_stm$nd / mod_stm$ndsum) %*% mod_stm$eta)[, 1]
cor(y, y_hat)

Multi-output STM (MLSTM)

J <- ncol(Y)

mu      <- rep(0, K)
upsilon <- K + 2
Omega   <- diag(K)

mod_mlstm <- run_mlstm_vi(
  count  = count,
  Y      = Y,
  K      = K,
  alpha  = 0.1,
  beta   = 0.01,
  mu     = mu,
  upsilon = upsilon,
  Omega   = Omega,
  max_iter = 50,
  min_iter = 10,
  verbose  = FALSE
)

Y_hat <- ((mod_mlstm$nd / mod_mlstm$ndsum) %*% mod_mlstm$eta)
cor(Y, Y_hat)

Data Format

Each row of count represents one non-zero document-term entry.

columndescription
ddocument index (0-based)
vword index (0-based)
ctoken count

Performance

  • Implemented in C++ via Rcpp
  • Parallelized with RcppParallel
  • Suitable for large-scale text and supervised learning

Documentation

  • pkgdown site: https://thimeno1993.github.io/mlstm

References

  • Himeno T, Yokouchi D (2023). “A Multi-Label Supervised Topic Model for Financial Market Analysis Using News (in Japanese).” JAFEE Journal, 21, 1–28.
  • Himeno, T. and Yokouchi, D. (2026). "mlstm: Multilevel Supervised Topic Models with Multiple Outcomes in R." (Under submission to Journal of Statistical Software)

Author

Tomoya Himeno


License

MIT License


Development

devtools::load_all()
devtools::test()
devtools::check()

Issues

https://github.com/thimeno1993/mlstm/issues.

Metadata

Version

0.1.7

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows