MyNixOS website logo
Description

Streamlined Workflow for UK Biobank Data Extraction, Analysis, and Visualization.

Provides a streamlined workflow for UK Biobank cloud-based analysis on the Research Analysis Platform (RAP). Includes tools for phenotype extraction and decoding, variable derivation, survival and association analysis, genetic risk score computation, and publication-quality visualization. For details on the UK Biobank resource, see Bycroft et al. (2018) <doi:10.1038/s41586-018-0579-z>.
ukbflow logo

ukbflow

RAP-Native R Workflow for UK Biobank Analysis

CRAN status R-CMD-check Codecov Lifecycle

๐Ÿ“š Documentation โ€ข ๐Ÿš€ Get Started โ€ข ๐Ÿ’ฌ Issues โ€ข ๐Ÿค Contributing

Languages: English | ็ฎ€ไฝ“ไธญๆ–‡


[!NOTE] ๐ŸŽ‰ 2026-04 โ€” ukbflow is now available on CRAN! Install with install.packages("ukbflow").

Overview

ukbflow provides a streamlined, RAP-native R workflow for UK Biobank analysis โ€” from phenotype extraction and disease derivation to association analysis and publication-quality figures.

UK Biobank Data Policy (2024+): Individual-level data must remain within the RAP environment. Only summary-level outputs may be downloaded locally. All ukbflow functions are designed with this constraint in mind.

library(ukbflow)

# Simulate UKB-style data locally (on RAP: replace with extract_batch() + job_wait())
data <- ops_toy(n = 5000, seed = 2026) |>
  derive_missing()

# Derive lung cancer outcome (ICD-10 C34) and follow-up time
data <- data |>
  derive_icd10(name = "lung", icd10 = "C34",
               source = c("cancer_registry", "hes")) |>
  derive_followup(name        = "lung",
                  event_col   = "lung_icd10_date",
                  baseline_col = "p53_i0",
                  censor_date  = as.Date("2022-10-31"),
                  death_col    = "p40000_i0")

# Define exposure: ever vs. never smoker
data[, smoking_ever := factor(
  ifelse(p20116_i0 == "Never", "Never", "Ever"),
  levels = c("Never", "Ever")
)]

# Cox regression: smoking โ†’ lung cancer (3-model adjustment)
res <- assoc_coxph(data,
  outcome_col  = "lung_icd10",
  time_col     = "lung_followup_years",
  exposure_col = "smoking_ever",
  covariates   = c("p21022", "p31", "p22189"))

# Forest plot
res_df <- as.data.frame(res)
plot_forest(
  data      = res_df,
  est       = res_df$HR,
  lower     = res_df$CI_lower,
  upper     = res_df$CI_upper,
  ci_column = 2L
)

Installation

# From CRAN (recommended)
install.packages("ukbflow")

# Latest development version from GitHub
pak::pkg_install("evanbio/ukbflow")

# or
remotes::install_github("evanbio/ukbflow")

Requirements: R โ‰ฅ 4.1 ยท dxpy (dx-toolkit, required for RAP interaction)

pip install dxpy

Core Features

LayerKey FunctionsDescription
Connectionauth_login, auth_select_projectAuthenticate to RAP via dx-toolkit
Data Accessfetch_metadata, extract_batch, job_waitRetrieve phenotype data from UKB dataset on RAP
Data Processingdecode_names, decode_values, derive_icd10, derive_followup, derive_caseHarmonize multi-source records; derive analysis-ready cohort
Association Analysisassoc_coxph, assoc_logistic, assoc_subgroupThree-model adjustment; subgroup & trend analysis
Genomic Scoringgrs_bgen2pgen, grs_score, grs_standardizeDistributed plink2 scoring on RAP worker nodes
Visualizationplot_forest, plot_tableonePublication-ready figures & tables
Utilitiesops_setup, ops_toy, ops_na, ops_snapshot, ops_withdrawEnvironment check, synthetic data, pipeline diagnostics, and cohort management

Function Reference

Auth & Fetch
  • auth_login(), auth_status(), auth_logout(), auth_list_projects(), auth_select_project() โ€” RAP authentication
  • fetch_ls(), fetch_tree(), fetch_url(), fetch_file() โ€” RAP file system
  • fetch_metadata(), fetch_field() โ€” UKB metadata shortcuts
Extract & Decode
  • extract_ls(), extract_pheno(), extract_batch() โ€” phenotype extraction
  • decode_values() โ€” integer codes โ†’ human-readable labels
  • decode_names() โ€” field IDs โ†’ snake_case column names
Job Monitoring
  • job_status() โ€” query job status by ID
  • job_wait() โ€” block until job completes (with timeout)
  • job_path() โ€” get output path of a completed job
  • job_result() โ€” retrieve job result object
  • job_ls() โ€” list recent jobs
Derive โ€” Phenotypes
  • derive_missing() โ€” handle "Do not know" / "Prefer not to answer"
  • derive_covariate() โ€” type conversion + summary
  • derive_cut() โ€” bin continuous variables
  • derive_selfreport() โ€” self-reported disease status + date
  • derive_hes() โ€” HES inpatient ICD-10
  • derive_first_occurrence() โ€” First Occurrence fields
  • derive_cancer_registry() โ€” cancer registry
  • derive_death_registry() โ€” death registry
  • derive_icd10() โ€” combine sources (wrapper)
  • derive_case() โ€” merge self-report + ICD-10
Derive โ€” Survival
  • derive_timing() โ€” prevalent vs. incident classification
  • derive_age() โ€” age at event
  • derive_followup() โ€” follow-up end date and duration
Association Analysis
  • assoc_coxph() / assoc_cox() โ€” Cox proportional hazards (HR)
  • assoc_logistic() / assoc_logit() โ€” logistic regression (OR)
  • assoc_linear() / assoc_lm() โ€” linear regression (ฮฒ)
  • assoc_coxph_zph() โ€” proportional hazards assumption test
  • assoc_subgroup() โ€” stratified analysis + interaction LRT
  • assoc_trend() โ€” dose-response trend + p_trend
  • assoc_competing() โ€” Fine-Gray competing risks (SHR)
  • assoc_lag() โ€” lagged exposure sensitivity analysis
Visualisation
  • plot_forest() โ€” forest plot (PNG / PDF / JPG / TIFF, 300 dpi)
  • plot_tableone() โ€” Table 1 (DOCX / HTML / PDF / PNG)
Utilities & Diagnostics
  • ops_setup() โ€” environment health check (dx CLI, RAP auth, R packages)
  • ops_toy() โ€” generate synthetic UKB-like data for development and testing
  • ops_na() โ€” summarise missing values (NA and "") across all columns
  • ops_snapshot() โ€” record pipeline checkpoints and track dataset changes
  • ops_snapshot_cols() โ€” retrieve column list from a saved snapshot
  • ops_snapshot_diff() โ€” compare columns between two snapshots
  • ops_snapshot_remove() โ€” remove columns added after a given snapshot
  • ops_set_safe_cols() โ€” define protected columns that ops_snapshot_remove will not drop
  • ops_withdraw() โ€” exclude UKB withdrawn participants from a cohort
GRS Pipeline
  • grs_check() โ€” validate SNP weights file
  • grs_bgen2pgen() โ€” convert BGEN โ†’ PGEN on RAP (submits cloud jobs)
  • grs_score() โ€” score GRS across chromosomes with plink2
  • grs_standardize() / grs_zscore() โ€” Z-score standardisation
  • grs_validate() โ€” OR/HR per SD, high vs low, trend, AUC/C-index

Documentation

Full vignettes and function reference:

https://evanbio.github.io/ukbflow/


Contributing

Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md.


License

MIT License ยฉ 2026 Yibin Zhou


Made with โค๏ธ by Yibin Zhou

โฌ† Back to Top.

Metadata

Version

0.3.4

License

Unknown

Platformsย (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows