Core Data Contracts, Parsers, and Scoring Primitives for Clinical Submission Readiness.
r4subcore
r4subcore is the foundational package in the R4SUB ecosystem. It defines the core data contracts, parsers, evidence schema, and scoring primitives needed to quantify clinical submission readiness.
It is intentionally "boring and stable": other R4SUB packages (e.g., r4subtrace, r4subrisk, r4subscore) build on these structures and interfaces.
Why r4subcore?
Clinical submission readiness is rarely a single tool output. It's an evidence graph across:
- SDTM/ADaM datasets and metadata
- Define.xml / ARM / reviewer guides
- Validation results (Pinnacle21, OpenCDISC, internal rule engines)
- Traceability / derivations (ADaM spec <-> code <-> outputs)
- Usability / reviewer experience signals
r4subcore provides:
- A standardized Evidence Table schema
- Common parsers to ingest heterogeneous sources
- A consistent indicator / signal abstraction
- Scoring primitives (normalize, weight, calibrate, aggregate)
- A reproducible run context (run_id, dataset_id, study_id, tool version)
Package scope
In scope
- Evidence schema + validation
- Parsers for common sources (initial focus on P21-style outputs + define.xml scaffolding)
- Common utilities: ID generation, severity mapping, controlled terminology mapping, standard columns
- Indicator interfaces (how other packages implement signals)
- Transparent scoring components (no hidden "magic")
Out of scope
- Full SCI (Submission Confidence Index) calculation (belongs in
r4subscore) - End-to-end dashboards / Shiny apps (belongs in
r4subui) - Full traceability logic (belongs in
r4subtrace) - Domain-specific oncology rules (belongs in extension packages)
Installation
Development install
# install.packages("pak")
pak::pak("R4SUB/r4subcore")
Requirements
- R >= 4.2
- Suggested:
arrow,xml2,dplyr,readr,jsonlite,cli
Core concepts
1) Evidence Table (the heart of R4SUB)
All inputs are normalized into a single tabular contract: an evidence dataset. This enables scoring, drilldown, traceability, and reporting.
Minimum columns (v0.1):
| column | type | meaning |
|---|---|---|
run_id | chr | unique ID for a run |
study_id | chr | study identifier |
asset_type | chr | dataset, define, program, validation, spec, etc. |
asset_id | chr | unique ID of the asset (e.g., ADSL, define.xml) |
source_name | chr | tool/source name (e.g., pinnacle21) |
source_version | chr | tool version |
indicator_id | chr | the signal definition identifier |
indicator_name | chr | human name |
indicator_domain | chr | quality, trace, risk, usability |
severity | chr | info, low, medium, high, critical |
result | chr | pass, fail, warn, na |
metric_value | dbl | numeric value (if applicable) |
metric_unit | chr | unit for metric |
message | chr | short description |
location | chr | pointer (dataset/variable/rule line) |
evidence_payload | json | raw structured payload |
created_at | POSIXct | ingestion timestamp |
Guarantees:
- Each row is a single unit of evidence
- Evidence is immutable (append-only semantics recommended)
- Score consumers can rely on consistent meaning
Use:
as_evidence()to coerce raw datavalidate_evidence()to enforce contractbind_evidence()to combine sources safely
2) Indicators (signals)
An indicator is a definition of what to measure, not necessarily how to calculate it.
Indicators have:
indicator_id(stable)domain(quality/trace/risk/usability)descriptionexpected_inputs(evidence sources required)default_thresholds- optional
tags(e.g.,define,adam,sdtm,spec)
r4subcore provides:
- indicator registry helpers (local registry first, remote later)
- validation to ensure indicator metadata is well-formed
Other packages implement the actual calculations and output evidence rows using these IDs.
3) Scoring primitives (transparent & composable)
r4subcore includes small, auditable functions for:
- mapping severity -> numeric penalty
- normalizing metrics to 0-1
- applying weights
- aggregating evidence into indicator scores
SCI itself is not in this package.
Quick start
Create a run context
library(r4subcore)
ctx <- r4sub_run_context(
study_id = "ABC123",
environment = "DEV",
user = Sys.info()[["user"]]
)
ctx$run_id
Ingest validation results (example)
raw <- read.csv("p21_report.csv")
ev <- p21_to_evidence(
raw,
ctx = ctx,
asset_type = "validation",
source_version = "P21-3.0"
)
validate_evidence(ev)
Summarize evidence quickly
evidence_summary(ev)
Architecture
Main modules
R/evidence_schema.R-- schema + validatorsR/run_context.R-- run metadataR/parsers_p21.R-- Pinnacle21 ingestion (first parser)R/indicators.R-- indicator metadata + registryR/scoring_primitives.R-- severity mapping, normalization, aggregationR/utils_ids.R-- ID helpers, hashingR/utils_json.R-- JSON payload helpers
Extensibility
- New parsers should output evidence via
as_evidence() - New indicators should register IDs and domain metadata
- Consumers should never depend on tool-specific raw formats
Design principles
- Contract-first: normalize everything into evidence rows
- Transparent scoring: no black-box weights; everything configurable
- Tool-agnostic: support P21 now, but leave room for OpenCDISC, internal engines
- Reproducible: run_id + source_version captured everywhere
- Composable: small functions, no tight coupling
Roadmap
v0.1
- Evidence schema + validator
- Run context
- P21-style parser (CSV/XLSX minimal)
- Indicator metadata helpers
- Severity -> numeric mappings
- Minimal summarizers
v0.2
- define.xml ingestion (structure-level metadata)
- Arrow/parquet IO
- Evidence "joins" (dataset <-> variable <-> rule)
- Config profiles (e.g.,
FDA_ADaM_basic,EMA_SDTM_basic)
Contributing
- Use
devtools::check()before PR - Add tests for each parser and scoring function
- Do not break evidence schema without a version bump + migration note
License
MIT -- see LICENSE file.