MyNixOS website logo
Description

Structured Data Science Project Scaffolding.

Project scaffolding and workflow tools for reproducible data science. Manages packages, tracks data integrity, handles database connections, generates notebooks, and publishes to S3-compatible storage. More information at <https://framework.table1.org>.

Framework

An R package for structured, reproducible data analysis projects.

Status: Active development. APIs may change before version 1.0.

Quick Start

# Install from GitHub
remotes::install_github("table1/framework")

# One-time global setup (author info, preferences)
framework::setup()

# Create projects using your saved defaults
framework::new()
framework::new("my-analysis", "~/projects/my-analysis")
framework::new_presentation("quarterly-review", "~/talks/q4")
framework::new_course("stats-101", "~/teaching/stats")

Project Types

  • project (default): Full-featured research projects with notebooks, scripts, organized data management, and documentation
  • project_sensitive: Like project, but with additional privacy protections for sensitive data
  • course: Teaching materials with slides, assignments, and modules
  • presentation: Single talks with one Quarto file and minimal setup

Example project structure:

project/
├── notebooks/              # Exploratory analysis
├── scripts/                # Production pipelines
├── inputs/
│   ├── raw/                # Raw data (gitignored)
│   ├── intermediate/       # Cleaned datasets (gitignored)
│   ├── final/              # Curated analytic datasets (gitignored)
│   └── reference/          # External documentation (gitignored)
├── outputs/
│   ├── private/            # Tables, figures, models, cache (gitignored)
│   └── public/             # Share-ready artifacts
├── functions/              # Custom functions
├── docs/                   # Documentation
├── settings.yml            # Project configuration
├── framework.db            # Metadata tracking database
└── .env                    # Secrets (gitignored)

Why Framework?

Framework reduces boilerplate and enforces best practices:

  • Project scaffolding: Standardized directories, config-driven setup
  • Data management: Declarative data catalog, integrity tracking, encryption
  • Auto-loading: Load packages with one command; no more scattered library() calls
  • Pain-free renv: Reproducible package management without fighting renv
  • Caching: Smart caching for expensive computations
  • Database helpers: PostgreSQL, SQLite, DuckDB, MySQL with credential management
  • File formats: CSV, TSV, RDS, Stata (.dta), SPSS (.sav), SAS (.xpt, .sas7bdat)

Core Workflow

1. Initialize Your Session

library(framework)
scaffold()  # Loads packages, functions, config, standardizes working directory

2. Create Notebooks & Scripts

# Quarto notebook (default)
make_notebook("exploration")    # → notebooks/exploration.qmd
make_qmd("analysis")            # Always Quarto
make_rmd("report")              # RMarkdown

# Presentations
make_revealjs("slides")         # reveal.js presentation

# Scripts
make_script("process-data")     # → scripts/process-data.R

# List available templates
stubs_list()

Custom stubs: Create a stubs/ directory with your own templates.

3. Load Data

Via config (recommended):

# settings.yml
data:
  inputs:
    raw:
      survey:
        path: inputs/raw/survey.csv
        type: csv
        locked: true  # Errors if file changes
df <- data_load("inputs.raw.survey")

Direct path:

df <- data_load("inputs/raw/my_file.csv")       # CSV
df <- data_load("inputs/raw/stata_file.dta")    # Stata
df <- data_load("inputs/raw/spss_file.sav")     # SPSS

Every read is logged with a SHA-256 hash for integrity tracking.

4. Cache Expensive Operations

model <- get_or_cache("model_v1", {
  expensive_model_fit(df)
}, expire_after = 1440)  # 24 hours

5. Save Results

Save data files:

data_save(processed_df, "intermediate.cleaned_data")
# → saves to inputs/intermediate/cleaned_data.rds

data_save(final_df, "final.analysis_ready", type = "csv")
# → saves to inputs/final/analysis_ready.csv

Save analysis outputs:

result_save("regression_model", model, type = "model")
result_save("report", file = "report.html", type = "notebook", blind = TRUE)

6. Query Databases

# settings.yml
connections:
  db:
    driver: postgresql
    host: env("DB_HOST")
    database: env("DB_NAME")
    user: env("DB_USER")
    password: env("DB_PASS")
df <- query_get("SELECT * FROM users WHERE active = true", "db")

Enhanced Data Viewing

view_detail() provides rich, browser-based data exploration:

view_detail(mtcars)                    # Interactive table with search/filter/export
view_detail(config)                    # Tabbed YAML + R structure for lists
view_detail(ggplot(mtcars, aes(mpg, hp)) + geom_point())  # Interactive plots

Configuration

Simple:

default:
  packages:
    - dplyr
    - ggplot2
  data:
    example: data/example.csv

Advanced (split files):

default:
  data: settings/data.yml
  packages: settings/packages.yml
  connections: settings/connections.yml

Secrets in .env:

DB_HOST=localhost
DB_PASS=secret

Reference in config:

connections:
  db:
    host: env("DB_HOST")
    password: env("DB_PASS", "default")

AI Assistant Support

Framework creates instruction files for AI coding assistants:

framework::configure_ai_agents()

Supported: Claude Code (CLAUDE.md), GitHub Copilot, AGENTS.md

Key Functions

FunctionPurpose
scaffold()Initialize session (load packages, functions, config)
data_load()Load data from path or config
data_save()Save data with integrity tracking
view_detail()Browser-based data viewer with search/export
query_get()Execute SQL query, return data
query_execute()Execute SQL command
get_or_cache()Lazy evaluation with caching
result_save()Save analysis output
result_get()Retrieve saved result
scratch_capture()Quick debug/temp file save
renv_enable()Enable renv for reproducibility
packages_snapshot()Save package versions to renv.lock
packages_restore()Restore packages from renv.lock
security_audit()Scan for data leaks and security issues

Data Integrity & Security

  • Hash tracking: All data files tracked with SHA-256 hashes
  • Locked data: Flag files as read-only, errors on modification
  • Password-based encryption: Ansible Vault-style encryption for sensitive data
  • Gitignore by default: Private directories auto-ignored
  • Security audits: security_audit() detects data leaks

Encryption

# Save encrypted data
data_save(sensitive_df, "private.data", encrypted = TRUE)

# Load (auto-detects encryption)
data <- data_load("private.data")

Password from ENCRYPTION_PASSWORD env var or interactive prompt.

Security Auditing

audit <- security_audit()              # Full audit
audit <- security_audit(auto_fix = TRUE)  # Auto-fix .gitignore issues

Reproducibility with renv

Optional renv integration (off by default):

renv_enable()           # Enable for this project
packages_snapshot()     # Save current versions
packages_restore()      # Restore from renv.lock
renv_disable()          # Disable (keeps renv.lock)

Version pinning in settings.yml:

packages:
  - dplyr                    # Latest from CRAN
  - [email protected]           # Specific version
  - tidyverse/dplyr@main    # GitHub with branch

Roadmap

  • Better database support (DuckDB, MySQL, SQL Server)
  • Results publishing to S3
  • Enhanced results tracking with blinding support.
Metadata

Version

1.0.2

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows