MyNixOS website logo
Description

Seamless AWS Cloud Bursting for Parallel R Workloads.

A 'future' backend that enables seamless execution of parallel R workloads on 'Amazon Web Services' ('AWS', <https://aws.amazon.com>), including 'EC2' and 'Fargate'. 'staRburst' handles environment synchronization, data transfer, quota management, and worker orchestration automatically, allowing users to scale from local execution to 100+ cloud workers with a single line of code change.

staRburst staRburst logo

R-CMD-check License Version

Seamless AWS cloud bursting for parallel R workloads

staRburst lets you run parallel R code on AWS with zero infrastructure management. Scale from your laptop to 100+ cloud workers with a single function call. Supports both EC2 (recommended for performance and cost) and Fargate (serverless) backends.

Features

  • Simple Setup: One-time configuration (~2 minutes), then seamless operation
  • Simple API: Direct starburst_map() function - no new concepts to learn
  • Flexible Backends: EC2 (recommended - faster, cheaper, spot support) and Fargate (serverless)
  • Detached Sessions: Submit long-running jobs and detach - retrieve results anytime
  • Automatic Environment Sync: Your packages and dependencies automatically available on workers
  • Smart Quota Management: Automatically handles AWS quota limits with wave execution
  • Cost Transparent: See estimated and actual costs for every run
  • Auto Cleanup: Workers shut down automatically when done

Installation

CRAN submission in progress for v0.3.6 (expected within 2-4 weeks).

Once available:

install.packages("starburst")

Development version from GitHub:

remotes::install_github("scttfrdmn/starburst")

Quick Start

library(starburst)

# One-time setup (2 minutes)
starburst_setup()

# Run parallel computation on AWS
results <- starburst_map(
  1:1000,
  function(x) expensive_computation(x),
  workers = 50
)
#> šŸš€ Starting starburst cluster with 50 workers
#> šŸ’° Estimated cost: ~$2.80/hour
#> šŸ“Š Processing 1000 items with 50 workers
#> šŸ“¦ Created 50 chunks (avg 20 items per chunk)
#> šŸš€ Submitting tasks...
#> āœ“ Submitted 50 tasks
#> ā³ Progress: 50/50 tasks (3.2 minutes elapsed)
#>
#> āœ“ Completed in 3.2 minutes
#> šŸ’° Estimated cost: $0.15

Example: Monte Carlo Simulation

library(starburst)

# Define simulation
simulate_portfolio <- function(seed) {
  set.seed(seed)
  returns <- rnorm(252, mean = 0.0003, sd = 0.02)
  prices <- cumprod(1 + returns)

  list(
    final_value = prices[252],
    sharpe_ratio = mean(returns) / sd(returns) * sqrt(252)
  )
}

# Run 10,000 simulations on 100 AWS workers
results <- starburst_map(
  1:10000,
  simulate_portfolio,
  workers = 100
)
#> šŸš€ Starting starburst cluster with 100 workers
#> šŸ’° Estimated cost: ~$5.60/hour
#> šŸ“Š Processing 10000 items with 100 workers
#> ā³ Progress: 100/100 tasks (3.1 minutes elapsed)
#>
#> āœ“ Completed in 3.1 minutes
#> šŸ’° Estimated cost: $0.29

# Extract results
final_values <- sapply(results, function(x) x$final_value)
sharpe_ratios <- sapply(results, function(x) x$sharpe_ratio)

# Summary
mean(final_values)    # Average portfolio outcome
quantile(final_values, c(0.05, 0.95))  # Risk range

# Comparison:
# Local (single core): ~4 hours
# Cloud (100 workers): 3 minutes, $0.29

Advanced Usage

Reuse Cluster for Multiple Operations

# Create cluster once
cluster <- starburst_cluster(workers = 50, cpu = 4, memory = "8GB")

# Run multiple analyses
results1 <- cluster$map(dataset1, analysis_function)
results2 <- cluster$map(dataset2, processing_function)
results3 <- cluster$map(dataset3, modeling_function)

# All use the same Docker image and configuration

Custom Worker Configuration

# For memory-intensive workloads
results <- starburst_map(
  large_datasets,
  memory_intensive_function,
  workers = 20,
  cpu = 8,
  memory = "16GB"
)

# For CPU-intensive workloads
results <- starburst_map(
  cpu_tasks,
  cpu_intensive_function,
  workers = 50,
  cpu = 4,
  memory = "8GB"
)

Detached Sessions

Run long jobs and disconnect - results persist in S3:

# Start detached session
session <- starburst_session(workers = 50, detached = TRUE)

# Submit work and get session ID
session$submit(quote({
  results <- starburst_map(huge_dataset, expensive_function)
  saveRDS(results, "results.rds")
}))
session_id <- session$session_id

# Disconnect - job continues running
# Later (hours/days), reconnect:
session <- starburst_session_attach(session_id)
status <- session$status()  # Check progress
results <- session$collect()  # Get results

# Cleanup when done
session$cleanup(force = TRUE)

How It Works

  1. Environment Snapshot: Captures your R packages using renv
  2. Container Build: Creates Docker image with your environment, cached in ECR
  3. Task Distribution: Splits data into chunks across workers
  4. Task Submission: Launches Fargate tasks (or sequential batches if quota-limited)
  5. Data Transfer: Serializes task data to S3 using fast qs format
  6. Execution: Workers pull data, execute function on chunk items, push results
  7. Result Collection: Downloads and combines results in correct order
  8. Cleanup: Automatically shuts down workers

Cost Management

# Set cost limits
starburst_config(
  max_cost_per_job = 10,      # Hard limit
  cost_alert_threshold = 5     # Warning at $5
)

# Costs shown transparently
results <- starburst_map(data, fn, workers = 100)
#> šŸ’° Estimated cost: ~$3.50/hour
#> āœ“ Completed in 23 minutes
#> šŸ’° Estimated cost: $1.34

Quota Management

staRburst automatically handles AWS Fargate quota limitations:

results <- starburst_map(data, fn, workers = 100, cpu = 4)
#> ⚠ Requested 100 workers (400 vCPUs) but quota allows 25 workers (100 vCPUs)
#> ⚠ Using 25 workers instead
#> šŸ’° Estimated cost: ~$1.40/hour

Your work still completes, just with fewer workers. You can request quota increases through AWS Service Quotas.

API Reference

Main Functions

  • starburst_map(.x, .f, workers, ...) - Parallel map over data
  • starburst_cluster(workers, cpu, memory) - Create reusable cluster
  • starburst_setup() - Initial AWS configuration
  • starburst_config(...) - Update configuration
  • starburst_status() - Check cluster status

Configuration Options

starburst_config(
  region = "us-east-1",
  max_cost_per_job = 10,
  cost_alert_threshold = 5
)

Documentation

Full documentation available at starburst.ing

Comparison

FeaturestaRburstRStudio Server on EC2Coiled (Python)
Setup time2 minutes30+ minutes5 minutes
Infrastructure managementZeroManualZero
Learning curveMinimalMediumMedium
Auto scalingYesNoYes
Cost optimizationAutomaticManualAutomatic
R-nativeYesYesNo (Python)

Requirements

  • R >= 4.0
  • AWS account with:
    • AWS CLI configured or AWS_PROFILE set
    • IAM permissions for ECS, ECR, S3, VPC
    • Two IAM roles (created during setup):
      • starburstECSExecutionRole - for ECS/ECR access
      • starburstECSTaskRole - for S3 access

For detailed setup instructions, see the Getting Started guide.

Roadmap

v0.3.6 (Current - CRAN Submission)

  • āœ… Direct API (starburst_map, starburst_cluster)
  • āœ… AWS Fargate integration
  • āœ… EC2 backend support with spot instances
  • āœ… Detached session mode for long-running jobs
  • āœ… Automatic environment management
  • āœ… Cost tracking and quota handling
  • āœ… Full future backend integration
  • āœ… Support for future.apply, furrr, targets
  • āœ… Comprehensive AWS integration testing
  • āœ… CRAN-ready (0 errors, 0 notes)

Future Releases

  • [ ] Performance optimizations
  • [ ] Enhanced error recovery
  • [ ] Interactive progress monitoring
  • [ ] Multi-region support

Contributing

Contributions welcome! See the GitHub repository for contribution guidelines.

License

Apache License 2.0 - see LICENSE

Copyright 2026 Scott Friedman

Citation

@software{starburst,
  title = {staRburst: Seamless AWS Cloud Bursting for R},
  author = {Scott Friedman},
  year = {2026},
  version = {0.3.6},
  url = {https://starburst.ing},
  license = {Apache-2.0}
}

Credits

Built using the paws AWS SDK for R.

Container management with renv and rocker.

Inspired by Coiled for Python/Dask.

Metadata

Version

0.3.8

License

Unknown

PlatformsĀ (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows