Description

Seamless AWS Cloud Bursting for Parallel R Workloads.

Description

A 'future' backend that enables seamless execution of parallel R workloads on 'Amazon Web Services' ('AWS', <https://aws.amazon.com>), including 'EC2' and 'Fargate'. 'staRburst' handles environment synchronization, data transfer, quota management, and worker orchestration automatically, allowing users to scale from local execution to 100+ cloud workers with a single line of code change.

README.md

cran.r-project.org

staRburst

Seamless AWS cloud bursting for parallel R workloads

staRburst lets you run parallel R code on AWS with zero infrastructure management. Scale from your laptop to 100+ cloud workers with a single function call. Supports both EC2 (recommended for performance and cost) and Fargate (serverless) backends.

Features

Simple Setup: One-time configuration (~2 minutes), then seamless operation
Simple API: Direct starburst_map() function - no new concepts to learn
Flexible Backends: EC2 (recommended - faster, cheaper, spot support) and Fargate (serverless)
Detached Sessions: Submit long-running jobs and detach - retrieve results anytime
Automatic Environment Sync: Your packages and dependencies automatically available on workers
Smart Quota Management: Automatically handles AWS quota limits with wave execution
Cost Transparent: See estimated and actual costs for every run
Auto Cleanup: Workers shut down automatically when done

Installation

CRAN submission in progress for v0.3.6 (expected within 2-4 weeks).

Once available:

install.packages("starburst")

Development version from GitHub:

remotes::install_github("scttfrdmn/starburst")

Quick Start

library(starburst)

# One-time setup (2 minutes)
starburst_setup()

# Run parallel computation on AWS
results <- starburst_map(
  1:1000,
  function(x) expensive_computation(x),
  workers = 50
)
#> 🚀 Starting starburst cluster with 50 workers
#> 💰 Estimated cost: ~$2.80/hour
#> 📊 Processing 1000 items with 50 workers
#> 📦 Created 50 chunks (avg 20 items per chunk)
#> 🚀 Submitting tasks...
#> ✓ Submitted 50 tasks
#> ⏳ Progress: 50/50 tasks (3.2 minutes elapsed)
#>
#> ✓ Completed in 3.2 minutes
#> 💰 Estimated cost: $0.15

Example: Monte Carlo Simulation

library(starburst)

# Define simulation
simulate_portfolio <- function(seed) {
  set.seed(seed)
  returns <- rnorm(252, mean = 0.0003, sd = 0.02)
  prices <- cumprod(1 + returns)

  list(
    final_value = prices[252],
    sharpe_ratio = mean(returns) / sd(returns) * sqrt(252)
  )
}

# Run 10,000 simulations on 100 AWS workers
results <- starburst_map(
  1:10000,
  simulate_portfolio,
  workers = 100
)
#> 🚀 Starting starburst cluster with 100 workers
#> 💰 Estimated cost: ~$5.60/hour
#> 📊 Processing 10000 items with 100 workers
#> ⏳ Progress: 100/100 tasks (3.1 minutes elapsed)
#>
#> ✓ Completed in 3.1 minutes
#> 💰 Estimated cost: $0.29

# Extract results
final_values <- sapply(results, function(x) x$final_value)
sharpe_ratios <- sapply(results, function(x) x$sharpe_ratio)

# Summary
mean(final_values)    # Average portfolio outcome
quantile(final_values, c(0.05, 0.95))  # Risk range

# Comparison:
# Local (single core): ~4 hours
# Cloud (100 workers): 3 minutes, $0.29

Advanced Usage

Reuse Cluster for Multiple Operations

# Create cluster once
cluster <- starburst_cluster(workers = 50, cpu = 4, memory = "8GB")

# Run multiple analyses
results1 <- cluster$map(dataset1, analysis_function)
results2 <- cluster$map(dataset2, processing_function)
results3 <- cluster$map(dataset3, modeling_function)

# All use the same Docker image and configuration

Custom Worker Configuration

# For memory-intensive workloads
results <- starburst_map(
  large_datasets,
  memory_intensive_function,
  workers = 20,
  cpu = 8,
  memory = "16GB"
)

# For CPU-intensive workloads
results <- starburst_map(
  cpu_tasks,
  cpu_intensive_function,
  workers = 50,
  cpu = 4,
  memory = "8GB"
)

Detached Sessions

Run long jobs and disconnect - results persist in S3:

# Start detached session
session <- starburst_session(workers = 50, detached = TRUE)

# Submit work and get session ID
session$submit(quote({
  results <- starburst_map(huge_dataset, expensive_function)
  saveRDS(results, "results.rds")
}))
session_id <- session$session_id

# Disconnect - job continues running
# Later (hours/days), reconnect:
session <- starburst_session_attach(session_id)
status <- session$status()  # Check progress
results <- session$collect()  # Get results

# Cleanup when done
session$cleanup(force = TRUE)

How It Works

Environment Snapshot: Captures your R packages using renv
Container Build: Creates Docker image with your environment, cached in ECR
Task Distribution: Splits data into chunks across workers
Task Submission: Launches Fargate tasks (or sequential batches if quota-limited)
Data Transfer: Serializes task data to S3 using fast qs format
Execution: Workers pull data, execute function on chunk items, push results
Result Collection: Downloads and combines results in correct order
Cleanup: Automatically shuts down workers

Cost Management

# Set cost limits
starburst_config(
  max_cost_per_job = 10,      # Hard limit
  cost_alert_threshold = 5     # Warning at $5
)

# Costs shown transparently
results <- starburst_map(data, fn, workers = 100)
#> 💰 Estimated cost: ~$3.50/hour
#> ✓ Completed in 23 minutes
#> 💰 Estimated cost: $1.34

Quota Management

staRburst automatically handles AWS Fargate quota limitations:

results <- starburst_map(data, fn, workers = 100, cpu = 4)
#> ⚠ Requested 100 workers (400 vCPUs) but quota allows 25 workers (100 vCPUs)
#> ⚠ Using 25 workers instead
#> 💰 Estimated cost: ~$1.40/hour

Your work still completes, just with fewer workers. You can request quota increases through AWS Service Quotas.

API Reference

Main Functions

starburst_map(.x, .f, workers, ...) - Parallel map over data
starburst_cluster(workers, cpu, memory) - Create reusable cluster
starburst_setup() - Initial AWS configuration
starburst_config(...) - Update configuration
starburst_status() - Check cluster status

Configuration Options

starburst_config(
  region = "us-east-1",
  max_cost_per_job = 10,
  cost_alert_threshold = 5
)

Documentation

Full documentation available at starburst.ing

Comparison

Feature	staRburst	RStudio Server on EC2	Coiled (Python)
Setup time	2 minutes	30+ minutes	5 minutes
Infrastructure management	Zero	Manual	Zero
Learning curve	Minimal	Medium	Medium
Auto scaling	Yes	No	Yes
Cost optimization	Automatic	Manual	Automatic
R-native	Yes	Yes	No (Python)

Requirements

R >= 4.0
AWS account with:
- AWS CLI configured or AWS_PROFILE set
- IAM permissions for ECS, ECR, S3, VPC
- Two IAM roles (created during setup):
  - starburstECSExecutionRole - for ECS/ECR access
  - starburstECSTaskRole - for S3 access

For detailed setup instructions, see the Getting Started guide.

Roadmap

v0.3.6 (Current - CRAN Submission)

✅ Direct API (starburst_map, starburst_cluster)
✅ AWS Fargate integration
✅ EC2 backend support with spot instances
✅ Detached session mode for long-running jobs
✅ Automatic environment management
✅ Cost tracking and quota handling
✅ Full future backend integration
✅ Support for future.apply, furrr, targets
✅ Comprehensive AWS integration testing
✅ CRAN-ready (0 errors, 0 notes)

Future Releases

[ ] Performance optimizations
[ ] Enhanced error recovery
[ ] Interactive progress monitoring
[ ] Multi-region support

Contributing

Contributions welcome! See the GitHub repository for contribution guidelines.

License

Apache License 2.0 - see LICENSE

Citation

@software{starburst,
  title = {staRburst: Seamless AWS Cloud Bursting for R},
  author = {Scott Friedman},
  year = {2026},
  version = {0.3.6},
  url = {https://starburst.ing},
  license = {Apache-2.0}
}

Credits

Built using the paws AWS SDK for R.

Container management with renv and rocker.

Inspired by Coiled for Python/Dask.

r-starburst

staRburst

Features

Installation

Quick Start

Example: Monte Carlo Simulation

Advanced Usage

Reuse Cluster for Multiple Operations

Custom Worker Configuration

Detached Sessions

How It Works

Cost Management

Quota Management

API Reference

Main Functions

Configuration Options

Documentation

Comparison

Requirements

Roadmap

v0.3.6 (Current - CRAN Submission)

Future Releases

Contributing

License

Citation

Credits

Version

License

Status

Source

Homepage

Platforms (80)

staRburst

Features

Installation

Quick Start

Example: Monte Carlo Simulation

Advanced Usage

Reuse Cluster for Multiple Operations

Custom Worker Configuration

Detached Sessions

How It Works

Cost Management

Quota Management

API Reference

Main Functions

Configuration Options

Documentation

Comparison

Requirements

Roadmap

v0.3.6 (Current - CRAN Submission)

Future Releases

Contributing

License

Citation

Credits

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)