MyNixOS website logo
Description

Clinical Publication.

Accelerate the process from clinical data to medical publication, including clinical data cleaning, significant result screening, and the generation of publish-ready tables and figures.

clinpubr: Clinical Publication

CRANstatus Codecov testcoverage R-CMD-check

Overview

clinpubr is an R package designed to streamline the workflow from clinical data processing to publication-ready outputs. It provides tools for clinical data cleaning, significant result screening, and generating tables/figures suitable for medical journals.

Key Features

  • Clinical Data Cleaning: Functions to handle missing values, standardize units, convert dates, and clean numerical/categorical variables.
  • Result Screening: Screening results of regression and interaction analysis with common variable transformations to identify key findings.
  • Publication-Ready Outputs: Generate baseline characteristic tables, forest plots, RCS curves, and other visualizations formatted for medical publications.

Installation

You can install clinpubr from CRAN with:

install.packages("clinpubr")

Optional Dependencies

Some functions require additional packages for full functionality. The package will automatically prompt you to install missing packages when needed. If you want to install the package with all dependencies, you can use:

install.packages("clinpubr", dependencies = TRUE)

Basic Usage

Cleaning Tools

Example 1.1: Generate Data Overview and Cleaning Recommendations

library(clinpubr)

# Sample messy data with various quality issues
messy_data <- data.frame(
  id = 1:15,
  # Numeric with outliers
  bmi = c(
    22.5, 23.1, 24.2, 21.8, 25.0, 23.5, 999, 24.1, 22.9, 23.8,
    21.5, 24.3, 23.0, 22.7, 23.9
  ),
  # Character with case inconsistency
  city = c(
    "Beijing", "BEIJING", "beijing", "Shanghai", "SHANGHAI",
    "Guangzhou", "chengdu", "CHENGDU", "Shenzhen", "shenzhen",
    "Beijing", "Shanghai", "Guangzhou", "Chengdu", "Shenzhen"
  ),
  # Numeric with negative values in predominantly positive
  height = c(
    1.75, 1.80, 1.65, 1.70, 1.85, 1.78, 1.68, 1.72, 1.76, 1.82,
    1.60, 1.62, 1.74, 179, -1
  ),
  # Date with suspicious year
  visit_date = as.Date(c(
    "2020-01-15", "2020-02-20", "2020-03-10", "2019-05-18", "2020-06-22",
    "2018-07-30", "2020-08-12", "2020-09-25", "2020-10-08", "2020-11-15",
    "2020-12-20", "1900-01-01", "2030-02-28", "2020-03-15", "2020-04-20"
  )),
  # Numeric stored as character
  age = c(
    "25", "26", "27", "28", "29", "30", "31", "32", "33", "34",
    "35", "unknown", "36", "37", "38"
  ),
  stringsAsFactors = FALSE
)

overview <- data_overview(messy_data)
#> === Data Overview Summary ===
#> Dataset: 15 rows, 6 columns
#> 
#> Variable Types:
#>   numeric   : 3 variables
#>   character : 2 variables
#>   date      : 1 variables
#> 
#> Found 6 potential quality issues:
#>   numeric_as_character     : 1 cases
#>   outliers                 : 2 cases
#>   negative_in_positive     : 1 cases
#>   suspicious_dates         : 1 cases
#>   case_issues              : 1 cases
#> 
#> Recommendations:
#>   - Consider converting these character variables to numeric: age
#>   - Review outliers in these numeric variables: bmi, height
#>   - Numeric variables with mostly positive values but containing negatives: height
#>   - Review suspicious dates (year < 1910 or > current year) in: visit_date
#>   - These character variables have case inconsistency issues: city - consider standardizing to lowercase or uppercase

print(overview$quality_issues$case_issues)
#> $city
#> $city$n_original
#> [1] 11
#> 
#> $city$n_normalized
#> [1] 5
#> 
#> $city$reduction
#> [1] 6
#> 
#> $city$examples
#> $city$examples$beijing
#> [1] "Beijing" "BEIJING" "beijing"
#> 
#> $city$examples$shanghai
#> [1] "Shanghai" "SHANGHAI"
#> 
#> $city$examples$chengdu
#> [1] "chengdu" "CHENGDU" "Chengdu"

Example 1.2: Screen Multi-Table Cohort by Entry and Anchor Rules

patient <- data.frame(pid = 1:4)
admission <- data.frame(
  pid = c(1, 1, 2, 3, 4),
  vid = c(11, 12, 21, 31, 41),
  admit_day = c(1, 5, 2, 3, 4)
)
diagnosis <- data.frame(
  pid = c(1, 2, 3, 4),
  vid = c(11, 21, 31, 41),
  dx_day = c(1, 2, 3, 4),
  icd = c("I10", "I10", "J18", "I11")
)
lab <- data.frame(
  pid = c(1, 1, 2, 2, 3, 4),
  vid = c(11, 12, 21, 21, 31, 41),
  lab_day = c(1, 5, 2, 5, 3, 4),
  Hb = c(9.8, 10.6, 10.7, 5, 8.9, 9.1)
)

# Keep patients with any I10 diagnosis, then keep records from first Hb > 10 onward, and join tables together
res <- screen_data_list(
  data_list = list(patient = patient, admission = admission, diagnosis = diagnosis, lab = lab),
  entry_expr = any(icd == "I10"),
  entry_level = "patient_id",
  anchor_expr = any(Hb > 10),
  anchor_level = "visit_id",
  anchor_window = "from_first_anchor",
  patient_id_map = "pid",
  visit_id_map = "vid",
  date_map = c(admission = "admit_day", diagnosis = "dx_day", lab = "lab_day"),
  output = "joined"
)

knitr::kable(res)
patient_idvisit_iddateicdHb
1125NA10.6
2212I1010.7
2215NA5.0

Example 1.3: Standardize Values in Medical Records

# Sample messy data
messy_data <- data.frame(values = c("12.3", "0..45", "  67 ", "", "abandon"))
clean_data <- value_initial_cleaning(messy_data$values)
print(clean_data)
#> [1] "12.3"    "0.45"    "67"      NA        "abandon"

Example 1.4: Check Non-numerical Values

# Sample messy data
x <- c("1.2(XXX)", "1.5", "0.82", "5-8POS", "NS", "FULL")
print(check_nonnum(x))
#> [1] "1.2(XXX)" "5-8POS"   "NS"       "FULL"

This function filters out non-numerical values, which helps you choose the appropriate method to handle them.

Example 1.5: Extracting Numerical Values from Text

# Sample messy data
x <- c("1.2(XXX)", "1.5", "0.82", "5-8POS", "NS", "FULL")
print(extract_num(x))
#> [1] 1.20 1.50 0.82 5.00   NA   NA

print(extract_num(x,
  res_type = "first", # Extract the first number
  multimatch2na = TRUE, # Convert illegal multiple matches to NA
  zero_regexp = "NEG|NS", # Convert "NEG" and "NS" (matched using regex) to 0
  max_regexp = "FULL", # Convert "FULL" (matched using regex) to some specified quantile
  max_quantile = 0.95
))
#> [1] 1.20 1.50 0.82   NA 0.00 1.47

Other Cleaning Functions

  • to_date(): Convert text to date, can handle mixed format.
  • unit_view() and unit_standardize(): Provide a pipeline to standardize conflicting units.
  • cut_by(): Split numerics into factors, offers a variety of splitting options and auto labeling.
  • And more…

Screening Results to Identify Potential Findings

data(cancer, package = "survival")

# Screening for potential findings with regression models in the cancer dataset
scan_result <- regression_scan(cancer, y = "status", time = "time", save_table = FALSE)
#> Taking all variables as predictors
knitr::kable(scan_result)
predictornvalidoriginal.HRoriginal.pvaloriginal.padjlogarithm.HRlogarithm.pvallogarithm.padjcategorized.HRcategorized.pvalcategorized.padjrcs.overall.pvalrcs.overall.padjrcs.nonlinear.pvalrcs.nonlinear.padjbest.var.trans
4ph.ecog2271.60953200.00002690.0002154NANANANA0.00015300.0012237NANANANAoriginal
6pat.karno2250.98034560.00028240.00112960.27095440.00030710.00153560.57556270.00066080.00264310.00258480.01550860.59089520.8863427original
3sex2280.58800280.00149120.0039766NANANA0.58800280.00149120.0039766NANANANAcategorized
5ph.karno2270.98368630.00495790.00991570.31841680.00794680.01986690.63524650.00776700.01553390.01284620.03853850.23079610.6848245original
2age2281.01889650.04185310.06696503.02567730.04669260.07782091.14407900.39106470.39575580.08254470.16508940.34241230.6848245original
1inst2270.99036920.34598380.46131170.92920460.31814320.39767900.83840470.26000400.34667200.81752770.87071310.98397050.9839705categorized
7meal.cal1810.99987620.59294020.67764590.91415800.61280950.61280950.86206040.39575580.39575580.87071310.87071310.82272560.9839705categorized
8wt.loss2141.00132010.82819740.8281974NANANA1.31901850.09090980.14545570.11289070.16933610.05149360.3089618rcs.nonlinear

Generating Publication-Ready Tables and Figures

Example 3.1: Automatic Type Infer and Baseline Table Generation

cohort <- data.frame(
  age = c(17, 25, 30, NA, 50, 60),
  sex = c("M", "F", "F", "M", "F", "M"),
  value = c(1, NA, 3, 4, 5, NA),
  dementia = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
)
res <- exclusion_count(
  cohort,
  age < 18,
  is.na(value),
  dementia == TRUE,
  .criteria_names = c(
    "Age < 18 years",
    "Missing value",
    "History of dementia"
  )
)
#> Warning in exclusion_count(cohort, age < 18, is.na(value), dementia == TRUE, :
#> Criterion 'Age < 18 years' resulted in NA values. These rows have been excluded
#> by default. Consider adding an explicit check for missing values (e.g.,
#> is.na(variable)) as a preceding criterion.
knitr::kable(res) # Display the table
CriteriaN
Initial N6
Age < 18 years2
Missing value2
History of dementia1
Final N1

Example 3.2: Automatic Type Infer and Baseline Table Generation

var_types <- get_var_types(mtcars, strata = "vs") # Automatically infer variable types
print(var_types)
#> $factor_vars
#> [1] "cyl"  "vs"   "am"   "gear"
#> 
#> $exact_vars
#> [1] "cyl"  "gear"
#> 
#> $nonnormal_vars
#> [1] "drat" "carb"
#> 
#> $omit_vars
#> NULL
#> 
#> $strata
#> [1] "vs"
#> 
#> attr(,"class")
#> [1] "var_types"

tables <- baseline_table(mtcars,
  var_types = var_types, contDigits = 1, save_table = FALSE,
  filename = "baseline.csv", seed = 1 # set seed for simulated fisher exact test
)
knitr::kable(tables$baseline) # Display the table
Overallvs: 0vs: 1ptest
n321814
mpg (mean (SD))20.1 (6.0)16.6 (3.9)24.6 (5.4)<0.001
cyl (%)<0.001exact
411 (34.4)1 (5.6)10 (71.4)
67 (21.9)3 (16.7)4 (28.6)
814 (43.8)14 (77.8)0 (0.0)
disp (mean (SD))230.7 (123.9)307.1 (106.8)132.5 (56.9)<0.001
hp (mean (SD))146.7 (68.6)189.7 (60.3)91.4 (24.4)<0.001
drat (median [IQR])3.7 [3.1, 3.9]3.2 [3.1, 3.7]3.9 [3.7, 4.1]0.013nonnorm
wt (mean (SD))3.2 (1.0)3.7 (0.9)2.6 (0.7)0.001
qsec (mean (SD))17.8 (1.8)16.7 (1.1)19.3 (1.4)<0.001
am = 1 (%)13 (40.6)6 (33.3)7 (50.0)0.556
gear (%)0.001exact
315 (46.9)12 (66.7)3 (21.4)
412 (37.5)2 (11.1)10 (71.4)
55 (15.6)4 (22.2)1 (7.1)
carb (median [IQR])2.0 [2.0, 4.0]4.0 [2.2, 4.0]1.5 [1.0, 2.0]<0.001nonnorm

Example 3.3: RCS Plot

data(cancer, package = "survival")

# Performing cox regression, which is inferred by `y` and `time`
p <- rcs_plot(cancer, x = "age", y = "status", time = "time", covars = c("sex", "ph.karno"), save_plot = FALSE)
#> Warning in predictor_effect_plot(data = data, x = x, y = y, time = time, : 1
#> incomplete cases excluded.
plot(p)

Example 3.4: Interaction Plot

data(cancer, package = "survival")

# Generating interaction plot of both linear and RCS models
p <- interaction_plot(cancer,
  y = "status", time = "time", predictor = "age",
  group_var = "sex", save_plot = FALSE
)
plot(p$lin)
plot(p$rcs)

Example 3.5: Regression Forest Plot

data(cancer, package = "survival")
cancer$dead <- cancer$status == 2 # Preparing a binary variable for logistic regression
cancer$`age per 1 sd` <- c(scale(cancer$age)) # Standardizing age

# Performing multivairate logistic regression
p1 <- regression_forest(cancer,
  model_vars = c("age per 1 sd", "sex", "wt.loss"), y = "dead",
  as_univariate = FALSE, save_plot = FALSE
)
plot(p1)

p2 <- regression_forest(
  cancer,
  model_vars = list(
    Crude = c("age per 1 sd"),
    Model1 = c("age per 1 sd", "sex"),
    Model2 = c("age per 1 sd", "sex", "wt.loss")
  ),
  y = "dead",
  save_plot = FALSE
)
plot(p2)

Example 3.6: Subgroup Forest Plot

data(cancer, package = "survival")
# coxph model with time assigned
p <- subgroup_forest(cancer,
  subgroup_vars = c("age", "sex", "wt.loss"), x = "ph.ecog", y = "status",
  time = "time", covars = "ph.karno", ticks_at = c(1, 2), save_plot = FALSE
)
plot(p)

Example 3.7: Classification Model Performance

# Building models with example data
data(cancer, package = "survival")
df <- kidney
df$dead <- ifelse(df$time <= 100 & df$status == 0, NA, df$time <= 100)
df <- na.omit(df[, -c(1:3)])

model0 <- glm(dead ~ age + frail, family = binomial(), data = df)
model1 <- glm(dead ~ ., family = binomial(), data = df)
df$base_pred <- predict(model0, type = "response")
df$full_pred <- predict(model1, type = "response")

# Generating most of the useful plots and metrics for model comparison
results <- classif_model_compare(df, "dead", c("base_pred", "full_pred"), save_output = FALSE)
#> Assuming 'TRUE' is [Event] and 'FALSE' is [non-Event]

knitr::kable(results$metric_table)
ModelAUCPRAUCAccuracySensitivitySpecificityPos Pred ValueNeg Pred ValueF1KappaBriercutoffYoudenHosLem
2full_pred0.915 (0.847, 0.984)0.8850.8390.80.8890.9030.7740.8480.6770.1140.6260.6890.944
1base_pred0.822 (0.711, 0.933)0.7660.8060.80.8150.8480.7590.8240.6100.1710.4900.6150.405
plot(results$roc_plot)
plot(results$pr_plot)
plot(results$calibration_plot)
plot(results$dca_plot)

Example 3.8: Importance Plot

# Generating a dummy importance vector
set.seed(5)
dummy_importance <- runif(20, 0.2, 0.6)^5
names(dummy_importance) <- paste0("var", 1:20)

# Plotting variable importance, keeping only top 15 and splitting at 10
p <- importance_plot(dummy_importance, top_n = 15, split_at = 10, save_plot = FALSE)
plot(p)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_bar()`).

Documentation

For detailed usage, refer to the package vignettes (coming soon) or the GitHub repository.

Contributing

Bug reports and feature requests are welcome via the issue tracker.

License

clinpubr is licensed under GPL (>= 3).

Metadata

Version

1.3.0

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows