Filter Based Feature Selection for 'mlr3'.
mlr3filters
Package website: release | dev
{mlr3filters} adds feature selection filters to mlr3. The implemented filters can be used stand-alone, or as part of a machine learning pipeline in combination with mlr3pipelines and the filter operator.
Wrapper methods for feature selection are implemented in mlr3fselect. Learners which support the extraction feature importance scores can be combined with a filter from this package for embedded feature selection.
Installation
CRAN version
install.packages("mlr3filters")
Development version
remotes::install_github("mlr-org/mlr3filters")
Filters
Filter Example
set.seed(1)
library("mlr3")
library("mlr3filters")
task = tsk("sonar")
filter = flt("auc")
head(as.data.table(filter$calculate(task)))
## feature score
## 1: V11 0.2811368
## 2: V12 0.2429182
## 3: V10 0.2327018
## 4: V49 0.2312622
## 5: V9 0.2308442
## 6: V48 0.2062784
Implemented Filters
Name | label | Task Types | Feature Types | Package |
---|---|---|---|---|
anova | ANOVA F-Test | Classif | Integer, Numeric | stats |
auc | Area Under the ROC Curve Score | Classif | Integer, Numeric | mlr3measures |
carscore | Correlation-Adjusted coRrelation Score | Regr | Logical, Integer, Numeric | care |
carsurvscore | Correlation-Adjusted coRrelation Survival Score | Surv | Integer, Numeric | carSurv, mlr3proba |
cmim | Minimal Conditional Mutual Information Maximization | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
correlation | Correlation | Regr | Integer, Numeric | stats |
disr | Double Input Symmetrical Relevance | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
find_correlation | Correlation-based Score | Universal | Integer, Numeric | stats |
importance | Importance Score | Universal | Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct | |
information_gain | Information Gain | Classif & Regr | Integer, Numeric, Factor, Ordered | FSelectorRcpp |
jmi | Joint Mutual Information | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
jmim | Minimal Joint Mutual Information Maximization | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
kruskal_test | Kruskal-Wallis Test | Classif | Integer, Numeric | stats |
mim | Mutual Information Maximization | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
mrmr | Minimum Redundancy Maximal Relevancy | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
njmim | Minimal Normalised Joint Mutual Information Maximization | Classif & Regr | Integer, Numeric, Factor, Ordered | praznik |
performance | Predictive Performance | Universal | Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct | |
permutation | Permutation Score | Universal | Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct | |
relief | RELIEF | Classif & Regr | Integer, Numeric, Factor, Ordered | FSelectorRcpp |
selected_features | Embedded Feature Selection | Universal | Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct | |
univariate_cox | Univariate Cox Survival Score | Surv | Integer, Numeric, Logical | survival |
variance | Variance | Universal | Integer, Numeric | stats |
Variable Importance Filters
The following learners allow the extraction of variable importance and therefore are supported by FilterImportance
:
## [1] "classif.featureless" "classif.ranger" "classif.rpart"
## [4] "classif.xgboost" "regr.featureless" "regr.ranger"
## [7] "regr.rpart" "regr.xgboost"
If your learner is not listed here but capable of extracting variable importance from the fitted model, the reason is most likely that it is not yet integrated in the package mlr3learners or the extra learner extension. Please open an issue so we can add your package.
Some learners need to have their variable importance measure “activated” during learner creation. For example, to use the “impurity” measure of Random Forest via the {ranger} package:
task = tsk("iris")
lrn = lrn("classif.ranger", seed = 42)
lrn$param_set$values = list(importance = "impurity")
filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
## feature score
## 1: Petal.Length 44.682462
## 2: Petal.Width 43.113031
## 3: Sepal.Length 9.039099
Performance Filter
FilterPerformance
is a univariate filter method which calls resample()
with every predictor variable in the dataset and ranks the final outcome using the supplied measure. Any learner can be passed to this filter with classif.rpart
being the default. Of course, also regression learners can be passed if the task is of type “regr”.
Filter-based Feature Selection
In many cases filtering is only one step in the modeling pipeline. To select features based on filter values, one can use PipeOpFilter
from mlr3pipelines.
library(mlr3pipelines)
task = tsk("spam")
# the `filter.frac` should be tuned
graph = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>%
po("learner", lrn("classif.rpart"))
learner = as_learner(graph)
rr = resample(task, learner, rsmp("holdout"))