Description
Traceability Engine for Clinical Submission Readiness.
Description
Quantifies and explains end-to-end traceability between clinical submission artifacts (ADaM (Analysis Data Model) outputs, derivations, SDTM (Study Data Tabulation Model) sources, specs, code). Builds trace models from metadata and mapping sheets, computes trace levels, and emits standardized R4SUB (R for Regulatory Submission) evidence table rows via 'r4subcore'.
README.md
r4subtrace
r4subtrace is the traceability engine in the R4SUB ecosystem. It quantifies and explains end-to-end traceability between clinical submission artifacts -- primarily ADaM outputs <-> derivations <-> SDTM sources <-> specs <-> code -- and converts trace evidence into standardized R4SUB Evidence Table rows (from r4subcore).
It focuses on answering one question:
Can we prove where each analysis variable/value came from, and can a reviewer follow it?
Why r4subtrace?
In real submissions, issues are rarely "a single failed rule." Many are trace failures:
- Missing or ambiguous derivation documentation
- ADaM variable not linkable to SDTM sources
- Mismatch between spec and what code produces
- Inconsistent naming across specs, define.xml, and datasets
- Reviewer cannot reproduce or validate lineage
r4subtrace formalizes traceability as evidence + measurable indicators.
What r4subtrace measures
Traceability levels
- L0 -- None: no linkage available
- L1 -- Spec-only: ADaM spec defines derivation but no code mapping
- L2 -- Spec + source mapping: ADaM var mapped to SDTM vars/domains
- L3 -- Spec + code mapping: mapping exists with high confidence or derivation text
Installation
pak::pak(c("R4SUB/r4subcore", "R4SUB/r4subtrace"))
Quick start
1) Create run context
library(r4subcore)
library(r4subtrace)
ctx <- r4sub_run_context(study_id = "ABC123", environment = "DEV")
2) Load metadata
adam_meta <- read.csv("adam_metadata.csv") # columns: dataset, variable, label, type
sdtm_meta <- read.csv("sdtm_metadata.csv") # same structure
map <- read.csv("trace_map.csv")
# recommended columns:
# adam_dataset, adam_var, sdtm_domain, sdtm_var, derivation_text(optional), confidence(optional)
3) Build trace model and evidence
tm <- build_trace_model(
adam_meta = adam_meta,
sdtm_meta = sdtm_meta,
mapping = map
)
ev <- trace_model_to_evidence(tm, ctx = ctx, source_name = "r4subtrace", source_version = "0.1.0")
validate_evidence(ev)
evidence_summary(ev)
4) Compute trace coverage score
ind <- trace_indicator_scores(ev)
ind
Core objects
Trace Model
A list with:
nodes: tidy table of assets (dataset/variable/spec/program)edges: tidy table of relationships + confidencediagnostics: issues found (orphans, ambiguities, conflicts)
Trace Evidence
Evidence rows are emitted for:
- each ADaM variable trace level
- each orphan/ambiguity/conflict
- aggregate coverage metrics
Indicators
TRACE_VAR_COVERAGE_L2PLUS: proportion of ADaM variables with L2+ traceTRACE_VAR_COVERAGE_L3PLUS: proportion with L3+ traceTRACE_ORPHAN_VAR_COUNT: orphan ADaM vars with no SDTM mappingTRACE_AMBIGUOUS_MAPPING_COUNT: vars mapped to multiple SDTM sourcesTRACE_MEAN_TRACE_LEVEL: mean trace level across all ADaM variables
Design principles
- Graph-first: traceability is a graph problem
- Evidence-first: all conclusions are backed by explicit evidence rows
- Tool-agnostic: can ingest mapping from any source format
- Reviewer-centric: emphasize explainability, not just metrics
License
MIT.