Extensible Data Pattern Searching Framework.
nuggets
Extensible R framework for subgroup discovery (Atzmueller (2015)), contrast patterns (Chen (2022)), emerging patterns (Dong (1999)) and association rules (Agrawal (1994)). Both crisp (binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. Currently, the package searches for implicative association rules and conditional correlations (Hájek (1978)). A user-defined function may be defined to evaluate on each generated condition to search for custom patterns.
Installation
To install the stable version of nuggets
from CRAN, type the following command within the R session:
install.packages("nuggets")
You can also install the development version of nuggets
from GitHub with:
install.packages("devtools")
devtools::install_github("beerda/nuggets")
Examples
Search for Implicative Rules
We start with loading of the needed packages:
library(tidyverse)
library(nuggets)
We are going to use the CO2
dataset as an example:
head(CO2)
#> Plant Type Treatment conc uptake
#> 1 Qn1 Quebec nonchilled 95 16.0
#> 2 Qn1 Quebec nonchilled 175 30.4
#> 3 Qn1 Quebec nonchilled 250 34.8
#> 4 Qn1 Quebec nonchilled 350 37.2
#> 5 Qn1 Quebec nonchilled 500 35.3
#> 6 Qn1 Quebec nonchilled 675 39.2
First, the numeric columns need to be transformed to factors:
d <- mutate(CO2,
conc = cut(conc, c(-Inf, 175, 350, 675, Inf)),
uptake = cut(uptake, c(-Inf, 17.9, 28.3, 37.12)))
head(d)
#> Plant Type Treatment conc uptake
#> 1 Qn1 Quebec nonchilled (-Inf,175] (-Inf,17.9]
#> 2 Qn1 Quebec nonchilled (-Inf,175] (28.3,37.1]
#> 3 Qn1 Quebec nonchilled (175,350] (28.3,37.1]
#> 4 Qn1 Quebec nonchilled (175,350] <NA>
#> 5 Qn1 Quebec nonchilled (350,675] (28.3,37.1]
#> 6 Qn1 Quebec nonchilled (350,675] <NA>
Then every column can be dichotomized, i.e., dummy logical columns may be created for each factor level:
d <- dichotomize(d)
head(d)
#> # A tibble: 6 × 23
#> `Plant=Qn1` `Plant=Qn2` `Plant=Qn3` `Plant=Qc1` `Plant=Qc3` `Plant=Qc2`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE FALSE FALSE FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE FALSE FALSE FALSE
#> 5 TRUE FALSE FALSE FALSE FALSE FALSE
#> 6 TRUE FALSE FALSE FALSE FALSE FALSE
#> # ℹ 17 more variables: `Plant=Mn3` <lgl>, `Plant=Mn2` <lgl>, `Plant=Mn1` <lgl>,
#> # `Plant=Mc2` <lgl>, `Plant=Mc3` <lgl>, `Plant=Mc1` <lgl>,
#> # `Type=Quebec` <lgl>, `Type=Mississippi` <lgl>,
#> # `Treatment=nonchilled` <lgl>, `Treatment=chilled` <lgl>,
#> # `conc=(-Inf,175]` <lgl>, `conc=(175,350]` <lgl>, `conc=(350,675]` <lgl>,
#> # `conc=(675, Inf]` <lgl>, `uptake=(-Inf,17.9]` <lgl>,
#> # `uptake=(17.9,28.3]` <lgl>, `uptake=(28.3,37.1]` <lgl>
Before starting to search for the rules, it is good idea to create the vector of disjoints. Columns with equal values in the disjoint vector will not be combined together. This will speed-up the search as it makes no sense, e.g., to combine Plant=Qn1
and Plant=Qn2
in a single condition.
disj <- sub("=.*", "", colnames(d))
print(disj)
#> [1] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [7] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [13] "Type" "Type" "Treatment" "Treatment" "conc" "conc"
#> [19] "conc" "conc" "uptake" "uptake" "uptake"
Once the data are prepared, the dig_implications
function may be invoked. It takes the dataset as its first parameter and a pair of “tidyselect” expressions to select the column names to appear in the left- and right-hand side of the rule (antecedent and consequent).
result <- dig_implications(d,
antecedent = !starts_with("Treatment"),
consequent = starts_with("Treatment"),
disjoint = disj,
min_support = 0.02,
min_confidence = 0.8)
result <- arrange(result, desc(support))
print(result)
#> # A tibble: 225 × 7
#> antecedent consequent support confidence coverage lift count
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 {Type=Mississippi,uptake=… {Treatmen… 0.155 0.813 0.190 1.63 16
#> 2 {Type=Mississippi,uptake=… {Treatmen… 0.119 1 0.119 2 10
#> 3 {Plant=Qn1} {Treatmen… 0.0833 1 0.0833 2 7
#> 4 {Plant=Qn2} {Treatmen… 0.0833 1 0.0833 2 7
#> 5 {Plant=Qn3} {Treatmen… 0.0833 1 0.0833 2 7
#> 6 {Plant=Qc1} {Treatmen… 0.0833 1 0.0833 2 7
#> 7 {Plant=Qc3} {Treatmen… 0.0833 1 0.0833 2 7
#> 8 {Plant=Qc2} {Treatmen… 0.0833 1 0.0833 2 7
#> 9 {Plant=Mn3} {Treatmen… 0.0833 1 0.0833 2 7
#> 10 {Plant=Mn2} {Treatmen… 0.0833 1 0.0833 2 7
#> # ℹ 215 more rows
Custom Pattern Search
The nuggets
package allows to execute a user-defined callback function on each generated frequent condition. That way a custom type of patterns may be searched. The following example replicates the search for implicative rules with the custom callback function. For that, a dataset has to be dichotomized and the disjoint vector created as in the previous example:
head(d)
#> # A tibble: 6 × 23
#> `Plant=Qn1` `Plant=Qn2` `Plant=Qn3` `Plant=Qc1` `Plant=Qc3` `Plant=Qc2`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE FALSE FALSE FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE FALSE FALSE FALSE
#> 5 TRUE FALSE FALSE FALSE FALSE FALSE
#> 6 TRUE FALSE FALSE FALSE FALSE FALSE
#> # ℹ 17 more variables: `Plant=Mn3` <lgl>, `Plant=Mn2` <lgl>, `Plant=Mn1` <lgl>,
#> # `Plant=Mc2` <lgl>, `Plant=Mc3` <lgl>, `Plant=Mc1` <lgl>,
#> # `Type=Quebec` <lgl>, `Type=Mississippi` <lgl>,
#> # `Treatment=nonchilled` <lgl>, `Treatment=chilled` <lgl>,
#> # `conc=(-Inf,175]` <lgl>, `conc=(175,350]` <lgl>, `conc=(350,675]` <lgl>,
#> # `conc=(675, Inf]` <lgl>, `uptake=(-Inf,17.9]` <lgl>,
#> # `uptake=(17.9,28.3]` <lgl>, `uptake=(28.3,37.1]` <lgl>
print(disj)
#> [1] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [7] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [13] "Type" "Type" "Treatment" "Treatment" "conc" "conc"
#> [19] "conc" "conc" "uptake" "uptake" "uptake"
As we want to search for implicative rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent:
min_support <- 0.02
min_confidence <- 0.8
f <- function(condition, support, foci_supports) {
conf <- foci_supports / support
sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support
conf <- conf[sel]
supp <- foci_supports[sel]
lapply(seq_along(conf), function(i) {
list(antecedent = format_condition(names(condition)),
consequent = format_condition(names(conf)[[i]]),
support = supp[[i]],
confidence = conf[[i]])
})
}
The callback function f()
defines three arguments: condition
, support
and foci_supports
. The names of the arguments are not random. Based on the argument names of the callback function, the searching algorithm provides information to the function. Here condition
is a vector of indices representing the conjunction of predicates in a condition. By the predicate we mean the column in the source dataset. The support
argument gets the relative frequency of the condition in the dataset. foci_supports
is a vector of supports of special predicates, which we call “foci” (plural of “focus”), within the rows satisfying the condition. For implicative rules, foci are potential rule consequents.
Now we can run the digging for rules:
result <- dig(d,
f = f,
condition = !starts_with("Treatment"),
focus = starts_with("Treatment"),
disjoint = disj,
min_length = 1,
min_support = min_support)
As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame:
result <- result %>%
unlist(recursive = FALSE) %>%
map(as_tibble) %>%
do.call(rbind, .) %>%
arrange(desc(support))
print(result)
#> # A tibble: 225 × 4
#> antecedent consequent support confidence
#> <chr> <chr> <dbl> <dbl>
#> 1 {Type=Mississippi,uptake=(-Inf,17.9]} {Treatment=chilled} 0.155 0.813
#> 2 {Type=Mississippi,uptake=(28.3,37.1]} {Treatment=nonchill… 0.119 1
#> 3 {Plant=Qn1} {Treatment=nonchill… 0.0833 1
#> 4 {Plant=Qn2} {Treatment=nonchill… 0.0833 1
#> 5 {Plant=Qn3} {Treatment=nonchill… 0.0833 1
#> 6 {Plant=Qc1} {Treatment=chilled} 0.0833 1
#> 7 {Plant=Qc3} {Treatment=chilled} 0.0833 1
#> 8 {Plant=Qc2} {Treatment=chilled} 0.0833 1
#> 9 {Plant=Mn3} {Treatment=nonchill… 0.0833 1
#> 10 {Plant=Mn2} {Treatment=nonchill… 0.0833 1
#> # ℹ 215 more rows