MyNixOS website logo
Description

Partial Least Squares Regression Models with Big Matrices.

Fast partial least squares (PLS) for dense and out-of-core data. Provides SIMPLS (straightforward implementation of a statistically inspired modification of the PLS method) and NIPALS (non-linear iterative partial least-squares) solvers, plus kernel-style PLS variants ('kernelpls' and 'widekernelpls') with parity to 'pls'. Optimized for 'bigmemory'-backed matrices with streamed cross-products and chunked BLAS (Basic Linear Algebra Subprograms) (XtX/XtY and XXt/YX), optional file-backed score sinks, and deterministic testing helpers. Includes an auto-selection strategy that chooses between XtX SIMPLS, XXt (wide) SIMPLS, and NIPALS based on (n, p) and a configurable memory budget. About the package, Bertrand and Maumy (2023) <https://hal.science/hal-05352069>, and <https://hal.science/hal-05352061> highlighted fitting and cross-validating PLS regression models to big data. For more details about some of the techniques featured in the package, Dayal and MacGregor (1997) <doi:10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23>, Rosipal & Trejo (2001) <https://www.jmlr.org/papers/v2/rosipal01a.html>, Tenenhaus, Viennet, and Saporta (2007) <doi:10.1016/j.csda.2007.01.004>, Rosipal (2004) <doi:10.1007/978-3-540-45167-9_17>, Rosipal (2019) <https://ieeexplore.ieee.org/document/8616346>, Song, Wang, and Bai (2024) <doi:10.1016/j.chemolab.2024.105238>. Includes kernel logistic PLS with 'C++'-accelerated alternating iteratively reweighted least squares (IRLS) updates, streamed reproducing kernel Hilbert space (RKHS) solvers with reusable centering statistics, and bootstrap diagnostics with graphical summaries for coefficients, scores, and cross-validation workflows, alongside dedicated plotting utilities for individuals, variables, ellipses, and biplots. The streaming backend uses far less memory and keeps memory bounded across data sizes. For PLS1, streaming is often fast enough while preserving a small memory footprint; for PLS2 it remains competitive with a bounded footprint. On small problems that fit comfortably in RAM (random-access memory), dense in-memory solvers are slightly faster; the crossover occurs as n or p grow and the Gram/cross-product cost dominates.

bigPLSR, PLS Regression Models with Big Matrices

Frédéric Bertrand and Myriam Maumy

R-CMD-check R-hub

bigPLSR provides fast, scalable Partial Least Squares (PLS) with two execution backends:

  • Dense (backend = "arma"): in-memory Armadillo/BLAS for speed.
  • Big-matrix (backend = "bigmem"): chunked streaming over bigmemory::big.matrix for large data.

Both PLS1 (single response) and PLS2 (multi-response) are supported. PLS2 uses SIMPLS on cross-products in both backends for numerical parity.

Recent updates bring additional solvers and tooling:

  • Kernel PLS and wide kernel PLS are available alongside SIMPLS/NIPALS.
  • Plot helpers now include unit circles, loading arrows and VIP bar charts.
  • New wrappers simplify prediction, information-criteria based component selection, cross-validation and bootstrapping workflows.
  • Kalman-filter PLS and double RKHS solvers extend the modelling toolkit to streaming and dual-kernel settings.
  • Cross-validation and bootstrap helpers optionally run in parallel via the future ecosystem.
  • Fresh vignettes cover RKHS usage, Kalman-filter state management and the plotting utilities showcased below.

The package is set up to be CRAN-friendly: the optional CBLAS fast path is off by default.

Support for parallel computation and GPU is being developed.

This website and these examples were created by F. Bertrand and M. Maumy.

Installation

You can install the released version of bigPLSR from CRAN with:

install.packages("bigPLSR")

You can install the development version of bigPLSR from github with:

devtools::install_github("fbertran/bigPLSR")

Quick start

library(bigPLSR)

set.seed(1)
n <- 200; p <- 50
X <- matrix(rnorm(n*p), n, p)
y <- X[,1]*2 - X[,2] + rnorm(n)

# Dense PLS1 (fast)
fit <- pls_fit(X, y, ncomp = 3, backend = "arma", scores = "r")
str(list(
  coef=dim(fit$coefficients),
  scores=dim(fit$scores),
  ncomp=fit$ncomp
))

Big-matrix PLS1 with file-backed scores

options_val_before <- options("bigmemory.allow.dimnames")
options(bigmemory.allow.dimnames=TRUE)

bmX <- bigmemory::as.big.matrix(X)
bmy <- bigmemory::as.big.matrix(matrix(y, n, 1))

tmp=tempdir()
if(file.exists(paste(tmp,"scores.desc",sep="/"))){unlink(paste(tmp,"scores.desc",sep="/"))}
if(file.exists(paste(tmp,"scores.bin",sep="/"))){unlink(paste(tmp,"scores.bin",sep="/"))}
sink <- bigmemory::filebacked.big.matrix(
  nrow=n, ncol=3, type="double",
  backingfile="scores.bin",
  backingpath=tmp,
  descriptorfile="scores.desc"
)

fit_b <- pls_fit(
  bmX, bmy, ncomp=3, backend="bigmem", scores="big",
  scores_target="existing", scores_bm=sink,
  scores_colnames = c("t1","t2","t3"),
  return_scores_descriptor = TRUE
)

fit_b$scores_descriptor  # big.matrix.descriptor
options(bigmemory.allow.dimnames=options_val_before)

PLS2 (multi-response)

set.seed(2)
m <- 3
B <- matrix(rnorm(p*m), p, m)
Y <- scale(X, scale = FALSE) %*% B + matrix(rnorm(n*m, sd = 0.1), n, m)

# Dense PLS2 – SIMPLS on cross-products (parity with bigmem)
fit2 <- pls_fit(X, Y, ncomp = 2, backend = "arma", mode = "pls2", scores = "none")
str(list(coef=dim(fit2$coefficients), ncomp=fit2$ncomp))

API

pls_fit(
  X, y, ncomp,
  tol = 1e-8,
  backend = c("auto", "arma", "bigmem"),
  scores  = c("none", "r", "big"),
  chunk_size = 10000L,
  scores_name = "scores",
  mode = c("auto","pls1","pls2"),
  scores_target = c("auto","new","existing"),
  scores_bm = NULL,
  scores_backingfile = NULL,
  scores_backingpath = NULL,
  scores_descriptorfile = NULL,
  scores_colnames = NULL,
  return_scores_descriptor = FALSE
)

Auto selection

  • backend = "auto""bigmem" when X is a big.matrix (or descriptor), else "arma".
  • mode = "auto""pls1" when y is one column, else "pls2".

Return values

  • PLS1:coefficients (p), intercept (scalar), x_weights, x_loadings, y_loadings, scores (optional), x_means, y_mean, ncomp.
  • PLS2:coefficients (p×m), intercept (length m), x_weights (p×ncomp), x_loadings (p×ncomp), y_loadings (m×ncomp), scores (optional), x_means, y_means, ncomp.

Backends & algorithms

Dense path (backend = "arma")

  • PLS1: fast dense solver (BLAS).
  • PLS2: SIMPLS on cross-products
    1. Center X, Y, build XtX = XcᵀXc, XtY = XcᵀYc.
    2. Cholesky-whitened symmetric eigen solve; enforce symmetry and add tiny ridge to stabilize.
    3. Optional scores: T = Xc %*% W.

Big-matrix path (backend = "bigmem")

  • Chunked I/O from big.matrix with preallocated buffers.
  • PLS1: streaming cross-products and deflation; optional scores streamed chunk-wise into a sink.
  • PLS2: chunked cross-products (XtX += BᵀB, XtY += BᵀY) + the same SIMPLS solver for parity; optional score streaming: T = (X − μ) %*% W.

Both paths enforce symmetry (0.5*(M+Mᵀ)) before eigen and use a small ridge on XtX for stability.


Scores, sinks, and descriptors

  • scores = "none" – don’t compute scores.
  • scores = "r" – return an in-memory matrix.
  • scores = "big" – write to a big.matrix:
    • Provide a sink via scores_target = "existing" + scores_bm (big.matrix or descriptor), or
    • Let the function create file-backed storage via scores_backingfile (+ optional scores_backingpath, scores_descriptorfile).
  • scores_colnames – set column names of the scores.
  • return_scores_descriptor = TRUE – adds fit$scores_descriptor when scores is a big.matrix.

Determinism (tests & reproducibility)

For tight parity tests, force 1 BLAS thread and fix RNG:

set.seed(1)
if (requireNamespace("RhpcBLASctl", quietly = TRUE)) {
  RhpcBLASctl::blas_set_num_threads(1L)
} else {
  # Use env vars before BLAS loads in the session
  Sys.setenv(
    OMP_NUM_THREADS="1",
    OPENBLAS_NUM_THREADS="1",
    MKL_NUM_THREADS="1",
    VECLIB_MAXIMUM_THREADS="1",
    BLIS_NUM_THREADS="1"
  )
}

Performance tuning

  • chunk_size: default 10000L. On Apple Silicon, internal default is larger (e.g., 16384) when chunk_size == 0. Tune per dataset for best GEMM throughput.
  • Scores streaming: with scores="big", streaming avoids holding T fully in RAM.
  • Multi-thread BLAS: for production, allow multi-thread BLAS; for tests, use 1 thread.

Optional CBLAS fast path (in-place GEMM)

Default: OFF (CRAN-safe).
An optional in-place accumulation (true beta = 1CBLASdgemm) is available and guarded by compile-time checks. When not available or not enabled, the package falls back automatically to the portable Armadillo path.

Enable locally (Unix/macOS):

R CMD INSTALL .   --configure-vars="PKG_CPPFLAGS='-DBIGPLSR_USE_CBLAS'"

In src/Makevars, link to the same BLAS/LAPACK that R uses:

PKG_LIBS += $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
  • macOS: the code attempts <vecLib/cblas.h>; on Linux/others: <cblas.h>.
  • If headers aren’t present, the build silently falls back to the portable GEMM path.
  • Do not hardcode -lopenblas or -framework Accelerate; use R’s variables.

Windows: leave the macro off unless you’ve explicitly provided CBLAS headers/libs.


Development

  • Unit tests compare dense vs big-matrix backends for both PLS1/PLS2 with tight tolerances.
  • Vignettes and examples keep datasets small; file-backed output uses tempdir().

Citation

If you use bigPLSR in academic work, please cite this package and the relevant PLS method used.


Metadata

Version

0.7.2

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows