MyNixOS website logo
Description

Extract Information from Clinical Reports from 'Oncomine Reporter' and NCBI 'ClinVar'.

Clinical reports generated by 'Oncomine Reporter' software contain critical data in unstructured PDF format, making manual extraction time-consuming and error-prone. 'ORscraper' provides a coherent suite of functions to automate this process, allowing researchers to parse reports, identify key biomarkers, extract genetic variant tables, and filter results. It also integrates with the NCBI 'ClinVar' API <https://www.ncbi.nlm.nih.gov/clinvar/> to enrich extracted data.

ORscraper: An R Package for for extracting data from Oncomine Reporter’s clinical reports .

CRAN status DOI

Overview

ORscraper is an R package designed to extract relevant medical information from clinical reports generated by the Oncomine Reporter software. This package is intended for healthcare professionals and researchers working with genetic data who need to automate the extraction and processing of information from report files. ORscraper provides tools to identify biopsies, extract genetic variants and pathogenicity classifications, filter relevant data, and query databases such as NCBI ClinVar.

Installation

Install the released version of remotes from CRAN:

install.packages("ORscraper")

You can install ORscraper from GitHub using the following R code:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install ORscraper from GitHub
devtools::install_github("SamuelGonzalez0204/ORscraper")

Basic Usage

Below is a basic example of how to use ORscraper to extract information from PDF files:

library(ORscraper)

# Read content from a PDF file
example_pdf <- system.file("extdata", "100.1-example.pdf", package = "ORscraper")
lines <- read_pdf_content(example_pdf)

# Read content from mutation tables
genesFile <- system.file("extdata", "Genes.xlsx", package = "ORscraper")
genes <- read_excel(genesFile)
mutations <- unique(genes$GEN)

# Extract mutations values from the extracted text
genes_mut <- c()
pathogenicities <- c()
tableValues <- extract_values_from_tables(lines, mutations)
genes_mut <- c(genes_mut, tableValues[1])
pathogenicities <- c(pathogenicities, tableValues[2])

# Filter only pathogenic mutations
pathogenic_mutations <- filter_pathogenic_only(pathogenicities, genes_mut)

print(pathogenic_mutations)

Main Functions

The ORscraper package includes several key functions:

  • classify_biopsy(): Analyzes biopsy identifiers and categorizes them based on predefined rules.

  • extract_chip_id(): Extracts chip values from filenames matching specific patterns.

  • extract_fusions(): Identifies and extracts fusion variants from text lines.

  • extract_intermediate_values(): Searches for a specific text pattern and extracts consecutive values.

  • extract_values_from_tables(): Extracts information such as mutations, pathogenicity, and frequencies from tables in reports.

  • extract_values_start_end(): Extracts values based on start and end markers.

  • filter_pathogenic_only(): Filters mutations, retaining only those marked as “Pathogenic.”

  • read_pdf_content(): Extracts the content of a PDF and splits it into individual lines.

  • read_pdf_files(): Scans a directory and retrieves all PDF files.

  • search_ncbi_clinvar(): Queries the NCBI ClinVar database for germline classifications.

Metadata

Version

0.1.1

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows