Streamlined Workflow for UK Biobank Data Extraction, Analysis, and Visualization.

ukbflow
RAP-Native R Workflow for UK Biobank Analysis
๐ Documentation โข ๐ Get Started โข ๐ฌ Issues โข ๐ค Contributing
Languages: English | ็ฎไฝไธญๆ
[!NOTE] ๐ 2026-04 โ ukbflow is now available on CRAN! Install with
install.packages("ukbflow").
Overview
ukbflow provides a streamlined, RAP-native R workflow for UK Biobank analysis โ from phenotype extraction and disease derivation to association analysis and publication-quality figures.
UK Biobank Data Policy (2024+): Individual-level data must remain within the RAP environment. Only summary-level outputs may be downloaded locally. All
ukbflowfunctions are designed with this constraint in mind.
library(ukbflow)
# Simulate UKB-style data locally (on RAP: replace with extract_batch() + job_wait())
data <- ops_toy(n = 5000, seed = 2026) |>
derive_missing()
# Derive lung cancer outcome (ICD-10 C34) and follow-up time
data <- data |>
derive_icd10(name = "lung", icd10 = "C34",
source = c("cancer_registry", "hes")) |>
derive_followup(name = "lung",
event_col = "lung_icd10_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = "p40000_i0")
# Define exposure: ever vs. never smoker
data[, smoking_ever := factor(
ifelse(p20116_i0 == "Never", "Never", "Ever"),
levels = c("Never", "Ever")
)]
# Cox regression: smoking โ lung cancer (3-model adjustment)
res <- assoc_coxph(data,
outcome_col = "lung_icd10",
time_col = "lung_followup_years",
exposure_col = "smoking_ever",
covariates = c("p21022", "p31", "p22189"))
# Forest plot
res_df <- as.data.frame(res)
plot_forest(
data = res_df,
est = res_df$HR,
lower = res_df$CI_lower,
upper = res_df$CI_upper,
ci_column = 2L
)
Installation
# From CRAN (recommended)
install.packages("ukbflow")
# Latest development version from GitHub
pak::pkg_install("evanbio/ukbflow")
# or
remotes::install_github("evanbio/ukbflow")
Requirements: R โฅ 4.1 ยท dxpy (dx-toolkit, required for RAP interaction)
pip install dxpy
Core Features
| Layer | Key Functions | Description |
|---|---|---|
| Connection | auth_login, auth_select_project | Authenticate to RAP via dx-toolkit |
| Data Access | fetch_metadata, extract_batch, job_wait | Retrieve phenotype data from UKB dataset on RAP |
| Data Processing | decode_names, decode_values, derive_icd10, derive_followup, derive_case | Harmonize multi-source records; derive analysis-ready cohort |
| Association Analysis | assoc_coxph, assoc_logistic, assoc_subgroup | Three-model adjustment; subgroup & trend analysis |
| Genomic Scoring | grs_bgen2pgen, grs_score, grs_standardize | Distributed plink2 scoring on RAP worker nodes |
| Visualization | plot_forest, plot_tableone | Publication-ready figures & tables |
| Utilities | ops_setup, ops_toy, ops_na, ops_snapshot, ops_withdraw | Environment check, synthetic data, pipeline diagnostics, and cohort management |
Function Reference
Auth & Fetch
auth_login(),auth_status(),auth_logout(),auth_list_projects(),auth_select_project()โ RAP authenticationfetch_ls(),fetch_tree(),fetch_url(),fetch_file()โ RAP file systemfetch_metadata(),fetch_field()โ UKB metadata shortcuts
Extract & Decode
extract_ls(),extract_pheno(),extract_batch()โ phenotype extractiondecode_values()โ integer codes โ human-readable labelsdecode_names()โ field IDs โ snake_case column names
Job Monitoring
job_status()โ query job status by IDjob_wait()โ block until job completes (with timeout)job_path()โ get output path of a completed jobjob_result()โ retrieve job result objectjob_ls()โ list recent jobs
Derive โ Phenotypes
derive_missing()โ handle "Do not know" / "Prefer not to answer"derive_covariate()โ type conversion + summaryderive_cut()โ bin continuous variablesderive_selfreport()โ self-reported disease status + datederive_hes()โ HES inpatient ICD-10derive_first_occurrence()โ First Occurrence fieldsderive_cancer_registry()โ cancer registryderive_death_registry()โ death registryderive_icd10()โ combine sources (wrapper)derive_case()โ merge self-report + ICD-10
Derive โ Survival
derive_timing()โ prevalent vs. incident classificationderive_age()โ age at eventderive_followup()โ follow-up end date and duration
Association Analysis
assoc_coxph()/assoc_cox()โ Cox proportional hazards (HR)assoc_logistic()/assoc_logit()โ logistic regression (OR)assoc_linear()/assoc_lm()โ linear regression (ฮฒ)assoc_coxph_zph()โ proportional hazards assumption testassoc_subgroup()โ stratified analysis + interaction LRTassoc_trend()โ dose-response trend + p_trendassoc_competing()โ Fine-Gray competing risks (SHR)assoc_lag()โ lagged exposure sensitivity analysis
Visualisation
plot_forest()โ forest plot (PNG / PDF / JPG / TIFF, 300 dpi)plot_tableone()โ Table 1 (DOCX / HTML / PDF / PNG)
Utilities & Diagnostics
ops_setup()โ environment health check (dx CLI, RAP auth, R packages)ops_toy()โ generate synthetic UKB-like data for development and testingops_na()โ summarise missing values (NA and"") across all columnsops_snapshot()โ record pipeline checkpoints and track dataset changesops_snapshot_cols()โ retrieve column list from a saved snapshotops_snapshot_diff()โ compare columns between two snapshotsops_snapshot_remove()โ remove columns added after a given snapshotops_set_safe_cols()โ define protected columns that ops_snapshot_remove will not dropops_withdraw()โ exclude UKB withdrawn participants from a cohort
GRS Pipeline
grs_check()โ validate SNP weights filegrs_bgen2pgen()โ convert BGEN โ PGEN on RAP (submits cloud jobs)grs_score()โ score GRS across chromosomes with plink2grs_standardize()/grs_zscore()โ Z-score standardisationgrs_validate()โ OR/HR per SD, high vs low, trend, AUC/C-index
Documentation
Full vignettes and function reference:
https://evanbio.github.io/ukbflow/
Contributing
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md.
License
MIT License ยฉ 2026 Yibin Zhou
Made with โค๏ธ by Yibin Zhou