Description

Stable Diffusion Image Generation.

Description

Provides Stable Diffusion image generation in R using the 'ggmlR' tensor library. Supports text-to-image and image-to-image generation with multiple model versions (SD 1.x, SD 2.x, 'SDXL', Flux). Implements the full inference pipeline including CLIP text encoding, 'UNet' noise removal, and 'VAE' encoding/decoding. Unified sd_generate() entry point with automatic strategy selection (direct, tiled sampling, high-resolution fix) based on output resolution and available 'VRAM'. High-resolution generation (2K, 4K+) via tiled 'VAE' decoding, tiled diffusion sampling ('MultiDiffusion'), and classic two-pass refinement (text-to-image, then upscale with image-to-image). Multi-GPU parallel generation via sd_generate_multi_gpu(). Multi-GPU model parallelism via 'device_layout' in sd_ctx(): distribute diffusion, text encoders, and 'VAE' across separate 'Vulkan' devices. Built-in profiling (sd_profile_start(), sd_profile_summary()) for per-stage timing of text encoding, sampling, and 'VAE' decode. Interactive Shiny GUI via sd_app() with non-blocking asynchronous generation (C++ std::thread), live progress bar, auto-detection of model architecture, and ETA display. Supports CPU and 'Vulkan' GPU. No 'Python' or external API dependencies required. Cross-platform: Linux, macOS, Windows.

README.md

cran.r-project.org

sd2R

sd2R is an R package that provides a native, GPU-accelerated Stable Diffusion pipeline by wrapping the C++ implementation from stable-diffusion.cpp and using ggmlR as the tensor backend.

Overview

sd2R exposes a high-level R interface for text-to-image and image-to-image generation, while all heavy computation (tokenization, encoders, denoiser, sampler, VAE, model loading) is implemented in C++. Supports SD 1.x, SD 2.x, SDXL, and Flux model families. Targets local inference on Linux with Vulkan-enabled AMD GPUs (with automatic CPU fallback via ggml), without relying on external Python or web APIs.

Architecture

Flux without Python:

R  →  sd2R  →  ggmlR  →  ggml  →  Vulkan  →  GPU

C++ core (src/sd/): tokenizers, text encoders (CLIP, Mistral, Qwen, UMT5), diffusion UNet/MMDiT denoiser, samplers, VAE encoder/decoder, and model loading for .safetensors and .gguf weights.
R layer: user-facing pipeline functions, parameter validation, image helpers, testing, and documentation-friendly API.
Backend: links against ggmlR (headers via LinkingTo) and libggml.a, reusing the same GGML/Vulkan stack that also powers llamaR and other ggmlR-based packages.

Key Features

Unified sd_generate() — single entry point for all generation modes. Automatically selects the optimal strategy (direct, tiled sampling, or highres fix) based on output resolution and available VRAM (vram_gb parameter in sd_ctx()). Users don't need to think about tiling at all.
CRAN-ready defaults: verbose = FALSE by default — no console output unless explicitly enabled. Cross-platform build system with configure/configure.win generating Makevars from templates.
VRAM-aware auto-routing: queries free GPU memory at runtime and routes to direct generation (fits in VRAM), highres fix (txt2img + upscale + tiled img2img, preferred for coherent large images), or tiled sampling (MultiDiffusion fallback). VAE tiling is also VRAM-aware — enabled automatically only when free memory is insufficient for the given resolution. Set vram_gb in sd_ctx() to override auto-detection.
Multi-GPU data parallelism: sd_generate_multi_gpu() distributes prompts across Vulkan GPUs via callr, one process per GPU, with progress reporting.
Multi-GPU model parallelism: device_layout parameter in sd_ctx() distributes sub-models across multiple Vulkan GPUs within a single process. Presets: "mono" (all on one GPU), "split_encoders" (CLIP/T5 on GPU 1, diffusion + VAE on GPU 0), "split_vae" (CLIP/T5 + VAE on GPU 1, diffusion on GPU 0), "encoders_cpu" (text encoders on CPU). Manual override via diffusion_gpu, clip_gpu, vae_gpu.
Profiling: built-in per-stage timing via sd_profile_start() / sd_profile_stop() / sd_profile_summary(). Tracks model loading, text encoding (with CLIP/T5 breakdown), sampling, and VAE decode/encode stages.
Text-to-image generation supporting Stable Diffusion 1.x, 2.x, SDXL, and Flux models with typical generations taking a few seconds on Vulkan-enabled GPUs.
Image-to-image workflows with noise strength control and reuse of the same denoising pipeline as text-to-image. Requires vae_decode_only = FALSE in context.
Optional upscaling using a dedicated upscaler context managed entirely in C++ and exposed to R through external pointers.
VRAM-aware Tiled VAE for high-resolution images (2K, 4K+) with bounded VRAM usage. vae_mode = "auto" (default) queries free GPU memory before VAE decode and enables tiling only when estimated peak usage exceeds available VRAM (with a 50 MB safety reserve). Falls back to a pixel-area threshold (vae_auto_threshold) when Vulkan memory query is unavailable (CPU backend, no GPU). Supports per-axis relative tile sizing (vae_tile_rel_x, vae_tile_rel_y) for non-square aspect ratios.
Tiled diffusion sampling (MultiDiffusion): at each denoising step the latent is split into overlapping tiles, each denoised independently, and merged with Gaussian weighting. VRAM usage scales with tile size, not output resolution.
Highres Fix: classic two-pass pipeline — generates base image at native model resolution, upscales (bilinear or ESRGAN), then refines with tiled img2img at low denoising strength. Produces coherent high-resolution images (2K, 4K+) with global composition preserved.
Image utilities in R: saving generated images to PNG, converting between internal tensors and R raw vectors, and simple inspection of output tensors.
System introspection via sd_system_info(), reporting GGML/Vulkan capabilities as detected by ggmlR at build time.
Pipeline graph API: sd_pipeline() + sd_node() for composable, sequential multi-step workflows (txt2img → upscale → img2img → save). Pipelines are serializable to JSON via sd_save_pipeline() / sd_load_pipeline().
Shiny GUI: sd_app() launches an interactive web interface with auto-detection of model architecture, non-blocking async generation (C++ std::thread), live progress bar with ETA, and automatic role assignment for multi-file models (Flux, SD3).

Shiny GUI

Launch an interactive web interface for image generation:

sd_app(model_dir = "/path/to/models")

Features:

Auto-detects model architecture (Flux, SD3, SDXL, SD1/2) and assigns component roles (diffusion, VAE, CLIP, T5)
Non-blocking generation with live progress bar and ETA
Prevents incompatible model combinations

Pipeline Example

pipe <- sd_pipeline(
  sd_node("txt2img", prompt = "a cat in space", width = 512, height = 512),
  sd_node("upscale", factor = 2),
  sd_node("img2img", strength = 0.3),
  sd_node("save", path = "output.png")
)

# Save / load as JSON
sd_save_pipeline(pipe, "my_pipeline.json")
pipe <- sd_load_pipeline("my_pipeline.json")

# Run
ctx <- sd_ctx("model.safetensors")
sd_run_pipeline(pipe, ctx, upscaler_ctx = upscaler)

Implementation Details

Rcpp bindings: src/sd2R_interface.cpp defines a thin bridge between R and the C API in stable-diffusion.h, returning XPtr objects with custom finalizers for correct lifetime management of sd_ctx_t and upscaler_ctx_t.
Build system: configure / configure.win generate Makevars from .in templates, resolving ggmlR paths, OpenMP, and Vulkan at configure time. Per-target -include r_ggml_compat.h applied only to sd/*.cpp sources to avoid macro conflicts with system headers.
Package metadata: DESCRIPTION declares Rcpp and ggmlR in LinkingTo, and NAMESPACE is generated via roxygen2 with useDynLib and Rcpp imports.
On load: .onLoad() initializes logging and registers constant values that mirror the underlying C++ enums using 0-based indices.

CRAN Readiness

verbose = FALSE by default — no output unless requested.
Per-target compiler flags for cross-platform compatibility (Linux, macOS, Windows).
All C++ warnings fixed (-Winconsistent-missing-override, deprecated codecvt).
Large tokenizer vocabularies (CLIP, Mistral, Qwen, UMT5) downloaded automatically during installation from GitHub Releases, keeping the source tarball small.

Installation

# Install ggmlR first (if not already installed)
remotes::install_github("Zabis13/ggmlR")

# Install sd2R
remotes::install_github("Zabis13/sd2R")

During installation, the configure script automatically downloads tokenizer vocabulary files (~128 MB total) from GitHub Releases. This requires curl or wget.

Offline / Manual Installation

If you don't have internet access during installation, download the vocabulary files manually and place them into src/sd/ before building:

# Download from https://github.com/Zabis13/sd2R/releases/tag/assets
# Files: vocab.hpp, vocab_mistral.hpp, vocab_qwen.hpp, vocab_umt5.hpp

wget https://github.com/Zabis13/sd2R/releases/download/assets/vocab.hpp -P src/sd/
wget https://github.com/Zabis13/sd2R/releases/download/assets/vocab_mistral.hpp -P src/sd/
wget https://github.com/Zabis13/sd2R/releases/download/assets/vocab_qwen.hpp -P src/sd/
wget https://github.com/Zabis13/sd2R/releases/download/assets/vocab_umt5.hpp -P src/sd/

R CMD INSTALL .

System Requirements

R ≥ 4.1.0, C++17 compiler
curl or wget (for downloading vocabulary files during installation)
Optional GPU: libvulkan-dev + glslc (Linux) or Vulkan SDK (Windows)
Platforms: Linux, macOS, Windows (x86-64, ARM64)

Benchmarks

FLUX.1-dev Q4_K_S — 10 steps

CLIP-L + T5-XXL text encoders, VAE. sample_steps = 10.

Test	AMD RX 9070 (16 GB)	Tesla P100 (16 GB)	2x Tesla T4 (16 GB)
1. 768x768 direct	44.2 s	94.0 s	133.1 s
2. 1024x1024 tiled VAE	163.6 s	151.4 s	243.6 s
3. 2048x1024 highres fix	309.7 s	312.5 s	492.2 s
4. img2img 768x768 direct	29.6 s	51.0 s	73.5 s
5. 1024x1024 direct	163.0 s	152.2 s	243.3 s
6. Multi-GPU 4 prompts	--	--	284.9 s (4 img)

FLUX.1-dev Q4_K_S — 25 steps

CLIP-L + T5-XXL (Q5_K_M) text encoders, VAE. sample_steps = 25.

Test	AMD RX 9070 (16 GB)	2x Tesla T4 (16 GB)
768x768 direct	110.8 s	--
1024x1024 direct	--	553.1 s

Model size comparison

	SD 1.5	Flux Q4_K_S
Diffusion params	~860 MB	~6.5 GB
Text encoders	CLIP ~240 MB	CLIP-L + T5-XXL ~3.9 GB
Sampling per step (768x768)	~0.1–0.3 s	~3.9 s
Architecture	UNet	MMDiT (57 blocks)

Examples

For a live, runnable demo see the Kaggle notebook: Stable Diffusion in R (ggmlR + Vulkan GPU).

License

MIT.

r-sd2R

sd2R

Overview

Architecture

Key Features

Shiny GUI

Pipeline Example

Implementation Details

CRAN Readiness

Installation

Offline / Manual Installation

System Requirements

Benchmarks

FLUX.1-dev Q4_K_S — 10 steps

FLUX.1-dev Q4_K_S — 25 steps

Model size comparison

Examples

See Also

License

Version

License

Status

Source

Homepage

Platforms (80)

sd2R

Overview

Architecture

Key Features

Shiny GUI

Pipeline Example

Implementation Details

CRAN Readiness

Installation

Offline / Manual Installation

System Requirements

Benchmarks

FLUX.1-dev Q4_K_S — 10 steps

FLUX.1-dev Q4_K_S — 25 steps

Model size comparison

Examples

See Also

License

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)