MyNixOS website logo
Description

Haskell bindings to WebGPU Dawn for GPU computing and graphics.

This package provides Haskell bindings to Google's Dawn WebGPU implementation, enabling GPU computing and graphics programming from Haskell. It wraps the gpu.cpp library which provides a high-level C++ interface to Dawn.

webgpu-dawn

High-level, type-safe Haskell bindings to Google's Dawn WebGPU implementation.

This library enables portable GPU computing with a Production-Ready DSL designed for high-throughput inference (e.g., LLMs), targeting 300 TPS (Tokens Per Second) performance.


⚡ Core Design Principles

To achieve high performance and type safety, this library adheres to the following strict patterns:

  1. Type-Safe Monadic DSL: No raw strings. We use ShaderM for composability and type safety.
  2. Natural Math & HOAS: Standard operators (+, *) and Higher-Order Abstract Syntax (HOAS) for loops (loop ... $ \i -> ...).
  3. Profile-Driven: Performance tuning is based on Roofline Analysis.
  4. Async Execution: Prefer AsyncPipeline to hide CPU latency and maximize GPU occupancy.
  5. Hardware Acceleration: Mandatory use of Subgroup Operations and F16 precision for heavy compute (MatMul/Reduction).

🏎️ Performance & Profiling

We utilize a Profile-Driven Development (PDD) workflow to maximize throughput.

1. Standard Benchmarks & Roofline Analysis

Run the optimized benchmark to determine TFLOPS and check the Roofline classification (Compute vs Memory Bound).

# Run 2D Block-Tiling MatMul Benchmark (FP32)
cabal run bench-optimized-matmul -- --size 4096 --iters 50

Output Example:

[Compute]  137.4 GFLOPs
[Memory]   201.3 MB
[Status]   COMPUTE BOUND (limited by GPU FLOPs)
[Hint]     Use F16 and Subgroup Operations to break the roofline.

2. Visual Profiling (Chrome Tracing)

Generate a trace file to visualize CPU/GPU overlap and kernel duration.

cabal run bench-optimized-matmul -- --size 4096 --trace

  • Load: Open chrome://tracing or ui.perfetto.dev
  • Analyze: Import trace.json to identify gaps between kernel executions (CPU overhead).

3. Debugging

Use the GPU printf-style debug buffer to inspect values inside kernels.

-- In DSL:
debugPrintF "intermediate_val" val


🚀 Quick Start

1. High-Level API (Data Parallelism)

Zero boilerplate. Ideal for simple map/reduce tasks.

import WGSL.API
import qualified Data.Vector.Storable as V

main :: IO ()
main = withContext $ \ctx -> do
  input  <- toGPU ctx (V.fromList [1..100] :: V.Vector Float)
  result <- gpuMap (\x -> x * 2.0 + 1.0) input
  out    <- fromGPU' result
  print out

2. Core DSL (Explicit Control)

Required for tuning Shared Memory, Subgroups, and F16.

import WGSL.DSL

shader :: ShaderM ()
shader = do
  input  <- declareInputBuffer "in" (TArray 1024 TF16)
  output <- declareOutputBuffer "out" (TArray 1024 TF16)
   
  -- HOAS Loop: Use lambda argument 'i', NOT string "i"
  loop 0 1024 1 $ \i -> do
    val <- readBuffer input i
    -- f16 literals for 2x throughput
    let res = val * litF16 2.0 + litF16 1.0
    writeBuffer output i res


📚 DSL Syntax Cheatsheet

Types & Literals

Haskell TypeWGSL TypeLiteral ConstructorNote
Exp F32f32litF32 1.0 or 1.0Standard float
Exp F16f16litF16 1.0Half precision (Fast!)
Exp I32i32litI32 1 or 1Signed int
Exp U32u32litU32 1Unsigned int
Exp Bool_boollitBool TrueBoolean

Casting Helpers:i32(e), u32(e), f32(e), f16(e)

Control Flow (HOAS)

-- For Loop
loop start end step $ \i -> do ...

-- If Statement
if_ (val > 10.0) 
    (do ... {- then block -} ...) 
    (do ... {- else block -} ...)

-- Barrier
barrier  -- workgroupBarrier()


🧩 Kernel Fusion

For maximum performance, fuse multiple operations (Load -> Calc -> Store) into a single kernel to reduce global memory traffic.

import WGSL.Kernel

-- Fuse: Load -> Process -> Store
let pipeline = loadK inBuf >>> mapK (* 2.0) >>> mapK relu >>> storeK outBuf

-- Execute inside shader
unKernel pipeline i


📚 Architecture & Modules

Execution Model (Latency Hiding)

To maximize GPU occupancy, encoding is separated from submission.

  • WGSL.Async.Pipeline: Use for main loops. Allows CPU to encode Token N+1 while GPU processes Token N.
  • WGSL.Execute: Low-level synchronous execution (primarily for debugging).

Module Guide

FeatureModuleDescription
Subgroup OpsWGSL.DSLsubgroupMatrixLoad, mma, subgroupMatrixStore
F16 MathWGSL.DSLlitF16, vec4<f16> for 2x throughput
StructsWGSL.StructGeneric derivation for std430 layout compliance
AnalysisWGSL.AnalyzeRoofline analysis logic

📦 Installation

Pre-built Dawn binaries are downloaded automatically during installation.

cabal install webgpu-dawn


License

MIT License - see LICENSE file for details.

Acknowledgments

  • Dawn (Google): Core WebGPU runtime.
  • gpu.cpp (Answer.AI): High-level C++ API wrapper inspiration.
  • GLFW: Window management.

Contact

Maintainer: Junji Hashimoto [email protected].

Metadata

Version

0.1.1.0

License

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows