MyNixOS website logo
Description

Multivariate Outlier Detection and Replacement.

Provides a random forest based implementation of the method described in Chapter 7.1.2 (Regression model based anomaly detection) of Chandola et al. (2009) <doi:10.1145/1541880.1541882>. It works as follows: Each numeric variable is regressed onto all other variables by a random forest. If the scaled absolute difference between observed value and out-of-bag prediction of the corresponding random forest is suspiciously large, then a value is considered an outlier. The package offers different options to replace such outliers, e.g. by realistic values found via predictive mean matching. Once the method is trained on a reference data, it can be applied to new data.

{outForest}

CRAN status R-CMD-check Codecov test coverage

Overview

{outForest} is a multivariate anomaly detection method. Each numeric variable is regressed onto all other variables using a random forest. If the scaled absolute difference between observed value and out-of-bag prediction is larger than a prespecified threshold, then a value is considered an outlier. After identification of outliers, they can be replaced, e.g., by predictive mean matching from the non-outliers.

The method can be viewed as a multivariate extension of a basic univariate outlier detection method, in which a value is considered an outlier if it deviates from the mean by more than, say, three times the standard deviation. In the multivariate case, instead of comparing a value with the overall mean, rather the difference to the conditional mean is considered. {outForest} estimates this conditional mean by a random forest.

Once the method is trained on a reference data set, it can be applied to new data.

Installation

# From CRAN
install.packages("outForest")

# Development version
devtools::install_github("mayer79/outForest")

Usage

We first generate a data set with about 2% outliers values in each numeric column. Then, we try to identify them.

library(outForest)
set.seed(3)

# Generate data with outliers in numeric columns
head(irisWithOutliers <- generateOutliers(iris, p = 0.02))

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          5.1    3.500000          1.4         0.2  setosa
#          4.9    3.000000          1.4         0.2  setosa
#          4.7    3.200000          1.3         0.2  setosa
#          4.6    3.100000          1.5         0.2  setosa
#          5.0   -3.744405          1.4         0.2  setosa
#          5.4    3.900000          1.7         0.4  setosa
 
# Find outliers by random forest regressions and replace them by predictive mean matching
(out <- outForest(irisWithOutliers, allow_predictions = TRUE))

# Plot the number of outliers per numeric variable
plot(out)

# Information on outliers
head(outliers(out))

# row          col  observed predicted      rmse     score threshold replacement
#   5  Sepal.Width -3.744405  3.298493 0.7810172 -9.017596         3         2.8
#  20 Sepal.Length 10.164017  5.141093 0.6750468  7.440852         3         5.4
# 138  Petal.Width  4.721186  2.113464 0.3712539  7.024092         3         2.1
#  68  Petal.Width -1.188913  1.305339 0.3712539 -6.718452         3         1.2
# 137  Sepal.Width  8.054524  2.861445 0.7810172  6.649122         3         2.9
#  15 Petal.Length  6.885277  1.875646 0.7767877  6.449163         3         1.3

# Resulting data set with replaced outliers
head(Data(out))

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          5.1         3.5          1.4         0.2  setosa
#          4.9         3.0          1.4         0.2  setosa
#          4.7         3.2          1.3         0.2  setosa
#          4.6         3.1          1.5         0.2  setosa
#          5.0         2.8          1.4         0.2  setosa
#          5.4         3.9          1.7         0.4  setosa

# Out-of-sample application
iris1 <- iris[1, ]
iris1$Sepal.Length <- -1
pred <- predict(out, newdata = iris1)

# Did we find the outlier?
outliers(pred)

# row          col observed predicted      rmse    score threshold replacement
#   1 Sepal.Length       -1  4.960069 0.6750468 -8.82912         3         6.4

# Fixed data
Data(pred)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          6.4         3.5          1.4         0.2  setosa
Metadata

Version

1.0.1

License

Unknown

Platforms (77)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows