Description

Haskell bindings to WebGPU Dawn for GPU computing and graphics.

Description

This package provides Haskell bindings to Google's Dawn WebGPU implementation, enabling GPU computing and graphics programming from Haskell. It wraps the gpu.cpp library which provides a high-level C++ interface to Dawn.

README.md

hackage.haskell.org

webgpu-dawn

High-level, type-safe Haskell bindings to Google's Dawn WebGPU implementation.

This library enables portable GPU computing with a Production-Ready DSL designed for high-throughput inference (e.g., LLMs), targeting 300 TPS (Tokens Per Second) performance.

⚡ Core Design Principles

To achieve high performance and type safety, this library adheres to the following strict patterns:

Type-Safe Monadic DSL: No raw strings. We use ShaderM for composability and type safety.
Natural Math & HOAS: Standard operators (+, *) and Higher-Order Abstract Syntax (HOAS) for loops (loop ... $ \i -> ...).
Profile-Driven: Performance tuning is based on Roofline Analysis.
Async Execution: Prefer AsyncPipeline to hide CPU latency and maximize GPU occupancy.
Hardware Acceleration: Mandatory use of Subgroup Operations and F16 precision for heavy compute (MatMul/Reduction).

🏎️ Performance & Profiling

We utilize a Profile-Driven Development (PDD) workflow to maximize throughput.

1. Standard Benchmarks & Roofline Analysis

Run the optimized benchmark to determine TFLOPS and check the Roofline classification (Compute vs Memory Bound).

# Run 2D Block-Tiling MatMul Benchmark (FP32)
cabal run bench-optimized-matmul -- --size 4096 --iters 50

Output Example:

[Compute]  137.4 GFLOPs
[Memory]   201.3 MB
[Status]   COMPUTE BOUND (limited by GPU FLOPs)
[Hint]     Use F16 and Subgroup Operations to break the roofline.

2. Visual Profiling (Chrome Tracing)

Generate a trace file to visualize CPU/GPU overlap and kernel duration.

cabal run bench-optimized-matmul -- --size 4096 --trace

Load: Open chrome://tracing or ui.perfetto.dev
Analyze: Import trace.json to identify gaps between kernel executions (CPU overhead).

3. Debugging

Use the GPU printf-style debug buffer to inspect values inside kernels.

-- In DSL:
debugPrintF "intermediate_val" val

🚀 Quick Start

1. High-Level API (Data Parallelism)

Zero boilerplate. Ideal for simple map/reduce tasks.

import WGSL.API
import qualified Data.Vector.Storable as V

main :: IO ()
main = withContext $ \ctx -> do
  input  <- toGPU ctx (V.fromList [1..100] :: V.Vector Float)
  result <- gpuMap (\x -> x * 2.0 + 1.0) input
  out    <- fromGPU' result
  print out

2. Core DSL (Explicit Control)

Required for tuning Shared Memory, Subgroups, and F16.

import WGSL.DSL

shader :: ShaderM ()
shader = do
  input  <- declareInputBuffer "in" (TArray 1024 TF16)
  output <- declareOutputBuffer "out" (TArray 1024 TF16)
   
  -- HOAS Loop: Use lambda argument 'i', NOT string "i"
  loop 0 1024 1 $ \i -> do
    val <- readBuffer input i
    -- f16 literals for 2x throughput
    let res = val * litF16 2.0 + litF16 1.0
    writeBuffer output i res

📚 DSL Syntax Cheatsheet

Types & Literals

Haskell Type	WGSL Type	Literal Constructor	Note
`Exp F32`	`f32`	`litF32 1.0` or `1.0`	Standard float
`Exp F16`	`f16`	`litF16 1.0`	Half precision (Fast!)
`Exp I32`	`i32`	`litI32 1` or `1`	Signed int
`Exp U32`	`u32`	`litU32 1`	Unsigned int
`Exp Bool_`	`bool`	`litBool True`	Boolean

Casting Helpers:i32(e), u32(e), f32(e), f16(e)

Control Flow (HOAS)

-- For Loop
loop start end step $ \i -> do ...

-- If Statement
if_ (val > 10.0) 
    (do ... {- then block -} ...) 
    (do ... {- else block -} ...)

-- Barrier
barrier  -- workgroupBarrier()

🧩 Kernel Fusion

For maximum performance, fuse multiple operations (Load -> Calc -> Store) into a single kernel to reduce global memory traffic.

import WGSL.Kernel

-- Fuse: Load -> Process -> Store
let pipeline = loadK inBuf >>> mapK (* 2.0) >>> mapK relu >>> storeK outBuf

-- Execute inside shader
unKernel pipeline i

📚 Architecture & Modules

Execution Model (Latency Hiding)

To maximize GPU occupancy, encoding is separated from submission.

WGSL.Async.Pipeline: Use for main loops. Allows CPU to encode Token N+1 while GPU processes Token N.
WGSL.Execute: Low-level synchronous execution (primarily for debugging).

Module Guide

Feature	Module	Description
Subgroup Ops	`WGSL.DSL`	`subgroupMatrixLoad`, `mma`, `subgroupMatrixStore`
F16 Math	`WGSL.DSL`	`litF16`, `vec4<f16>` for 2x throughput
Structs	`WGSL.Struct`	`Generic` derivation for `std430` layout compliance
Analysis	`WGSL.Analyze`	Roofline analysis logic

📦 Installation

Pre-built Dawn binaries are downloaded automatically during installation.

cabal install webgpu-dawn

License

MIT License - see LICENSE file for details.

Acknowledgments

Dawn (Google): Core WebGPU runtime.
gpu.cpp (Answer.AI): High-level C++ API wrapper inspiration.
GLFW: Window management.

Contact

Maintainer: Junji Hashimoto [email protected].

webgpu-dawn

webgpu-dawn

⚡ Core Design Principles

🏎️ Performance & Profiling

1. Standard Benchmarks & Roofline Analysis

2. Visual Profiling (Chrome Tracing)

3. Debugging

🚀 Quick Start

1. High-Level API (Data Parallelism)

2. Core DSL (Explicit Control)

📚 DSL Syntax Cheatsheet

Types & Literals

Control Flow (HOAS)

🧩 Kernel Fusion

📚 Architecture & Modules

Execution Model (Latency Hiding)

Module Guide

📦 Installation

License

Acknowledgments

Contact

Version

License

Status

Source

Homepage

Platforms (80)

webgpu-dawn

⚡ Core Design Principles

🏎️ Performance & Profiling

1. Standard Benchmarks & Roofline Analysis

2. Visual Profiling (Chrome Tracing)

3. Debugging

🚀 Quick Start

1. High-Level API (Data Parallelism)

2. Core DSL (Explicit Control)

📚 DSL Syntax Cheatsheet

Types & Literals

Control Flow (HOAS)

🧩 Kernel Fusion

📚 Architecture & Modules

Execution Model (Latency Hiding)

Module Guide

📦 Installation

License

Acknowledgments

Contact

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)