MyNixOS website logo
Description

Comparing Automated Subject Indexing Methods in R.

Perform evaluation of automatic subject indexing methods. The main focus of the package is to enable efficient computation of set retrieval and ranked retrieval metrics across multiple dimensions of a dataset, e.g. document strata or subsets of the label set. The package also provides the possibility of computing bootstrap confidence intervals for all major metrics, with seamless integration of parallel computation and propensity scored variants of standard metrics.

CASIMiR: Comparing Automated Subject Indexing Methods in R

status R-CMD-check

CASIMiR is a toolbox to facilitate comparative analysis of automated subject indexing methods in R.

Why should you use CASIMiR?

Certainly you are able to compute your F-score, precision and recall metrics with your favourite metric function in scikit-learn or other ML libraries. But does that really help in understanding the quality of your favourite subject indexing method? If method $A$ scores 0.4 in F-score and method $B$ scores 0.41, does it mean $B$ is better than $A$? Maybe yes. But likely there are many nuances in the results you miss by looking at overall score functions. Did you know: the quality of subject suggestions may vary considerably among subject groups! It may also strongly depend on the amount of training material per subject term. Here comes CASIMiR: it will help you in a detailed drill-down analysis of your results. In addition, CASIMiR offers advanced metric functions, such as area under the precision-recall curve, NDCG, graded relevance metrics and propensity scored metrics. Last but not least: CASIMiR allows to compute metrics with confidence intervals, based on bootstrap methods. Thus, it will also help you estimate the uncertainty in your results due to the possibly limited size of your test sample.

Why R?

Mainly due to the authors' love for R, but here are some reasons that might convince other people:

  • R's user-friendly capabilities for data analysis with the tidyverse packages
  • professional visualisation with ggplot2
  • seamless handling of grouped data structures
  • efficient data wrangling libraries, such as collapse and dplyr, which are the backbone of CASIMiR
  • the wonderful and inclusive R community

Installation instructions

Install a stable development version from GitHub (requires compilation)

remotes::install_github("deutsche-nationalbibliothek/casimir")

Getting Started

Most functions expect at least two inputs: gold_standard and predicted. Both are expected to be data.frames with subject suggestions in a long format.

Example table for gold standard or predictions:

doc_idlabel_id
1A
1B
2A
3C

For ranked retrieval metrics, i.e. metrics taking into account an ranking of the subject suggestions based on some score, the input format also expects an additional score column:

doc_idlabel_idscore
1A0.73
1B0.15
2A0.92
3C0.34
res <- compute_set_retrieval_scores(
  gold_standard = dnb_gold_standard,
  predictions = dnb_test_predictions
)

head(res)

Acknowledgments

This work was created within the DNB AI Project. The project was funded by Federal Government Commissioner for Culture and the Media as part of the national AI strategy.

Metadata

Version

0.3.3

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows