Skip to content

JIT Optimization Log

Experiments with Cranelift JIT for audio graph compilation.

Current State (resin-jit)

The resin-jit crate provides generic JIT compilation with explicit SIMD support.

Performance Summary (44100 samples = 1 second of audio)

ModeTimevs Scalarvs Native
Scalar JIT209 µs1x6.6x slower
SIMD JIT (f32x4)5.1 µs41x faster6.6x faster
Native Rust loop31.4 µs6.6x faster1x

Key insight: SIMD JIT is faster than native Rust because:

  1. Zero bounds checking (direct pointer arithmetic)
  2. Explicit f32x4 vectorization (LLVM didn't auto-vectorize the native loop)
  3. 4x fewer loop iterations

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      resin-jit                              │
├─────────────────────────────────────────────────────────────┤
│  Traits:                                                    │
│  ├── JitCompilable      emit_ir() for single values         │
│  ├── SimdCompilable     emit_simd_ir() for f32x4 vectors    │
│  └── JitGraph           graph traversal for compilation     │
│                                                             │
│  Classification (JitCategory):                              │
│  ├── PureMath           inline, SIMD-able (gain, clip)      │
│  ├── Stateful           callbacks to Rust (delay, filter)   │
│  └── External           call external fn (noise, sin/cos)   │
│                                                             │
│  Compilation:                                               │
│  ├── compile_affine()       scalar: fn(f32) -> f32          │
│  └── compile_affine_simd()  block: fn(*f32, *f32, len)      │
└─────────────────────────────────────────────────────────────┘

SIMD Implementation Details

The compile_affine_simd() method generates a loop structure:

┌─────────────────────────────────────────────────────────────┐
│  SIMD Loop (processes 4 samples per iteration)              │
│  ├── Load f32x4 from input[i*4]                             │
│  ├── fmul with gain vector (splatted)                       │
│  ├── fadd with offset vector (splatted)                     │
│  └── Store f32x4 to output[i*4]                             │
├─────────────────────────────────────────────────────────────┤
│  Scalar Tail (handles remainder: len % 4)                   │
│  ├── Load single f32                                        │
│  ├── fmul, fadd                                             │
│  └── Store single f32                                       │
└─────────────────────────────────────────────────────────────┘

When to Use

ScenarioRecommendation
Compile-time known graphTier 4 codegen (zero overhead)
Simple effectsBlockProcessor trait (LLVM optimizes)
Dynamic graphs, pure mathSIMD JIT (5 µs for 44100 samples)
Dynamic graphs, statefulScalar JIT or interpret (callbacks needed)

Parity Verification

Extensive tests verify scalar JIT == SIMD JIT == native Rust:

  • Typical audio data (sine waves, -1 to 1 range)
  • Random data (multiple seeds)
  • 25 different buffer sizes (alignment edge cases)
  • 88 gain/offset parameter combinations
  • Edge values (tiny, huge, special floats)

Baseline Measurements (44100 samples = 1 second of audio)

BenchmarkTimeDescription
gain_rust_1sec15 µsPure Rust loop: sample * gain
gain_block_1sec2.6-3.1 µsBlockProcessor trait on Gain node
gain_jit_1sec61-66 µsPer-sample JIT (compile_gain)

The native Rust BlockProcessor is fastest because LLVM can vectorize the simple multiply loop.

Experiment 1: Block-based JIT with external context advancement

Goal: Amortize JIT function call overhead by processing blocks instead of per-sample.

Implementation:

  • compile_graph() generates a loop that processes all samples
  • External advance_context() function called per sample to update ctx.time and ctx.sample_index

Result: gain_jit_block_1sec = 71 µs

Analysis: Still slower than per-sample JIT (61 µs). The external function call per sample adds overhead, but at least it's in the same ballpark.

Experiment 2: Inline context advancement

Goal: Eliminate external function call overhead by inlining the context update.

Implementation:

  • Added #[repr(C)] to AudioContext and GraphState for predictable layout
  • Moved ctx to first field in GraphState (offset 0)
  • Generated inline load/add/store instructions for time += dt and sample_index += 1

Result: gain_jit_block_1sec = 120-121 µs (SLOWER!)

Analysis: Counterintuitively, the inline version is ~70% slower than the external call. Possible causes:

  1. Memory access pattern preventing optimization
  2. Increased code size in the hot loop affecting I-cache
  3. Cranelift not optimizing the loads/stores well
  4. The dt load could be hoisted outside the loop (it's constant) but isn't

Experiment 3: Skip context advancement for pure-math graphs

Goal: Avoid unnecessary work when no nodes need the context.

Implementation:

  • Check has_stateful flag from analyze_graph()
  • Only emit context advancement code if there are stateful nodes

Result: gain_jit_block_1sec = 20 µs (down from 121 µs!)

Analysis: 6x improvement for pure-math graphs! But still 8x slower than native (2.6 µs).

Important caveat: This is not a fair comparison:

  • Native BlockProcessor DOES advance context (ctx.advance()) every sample
  • But LLVM can inline, keep values in registers, and vectorize the loop
  • JIT does scalar operations without SIMD

Why Native WAS Faster (Before SIMD)

Native Rust's 2.6 µs vs scalar JIT's 20 µs:

  • Native: LLVM can vectorize output[i] = input[i] * gain into SIMD (8 samples at once)
  • Scalar JIT: Cranelift generates scalar code (1 sample at a time)
  • Native: Context values stay in registers
  • Scalar JIT: Every access is a memory load/store

Why SIMD JIT is Now Faster Than Native

After implementing explicit SIMD, JIT beats native Rust (5.1 µs vs 31.4 µs):

  1. No bounds checking: JIT uses raw pointer arithmetic

    • Native: output[i] = input[i] * gain has 2 bounds checks per iteration
    • SIMD JIT: Direct pointer offset, no checks
  2. Guaranteed vectorization: We explicitly emit f32x4 operations

    • Native: LLVM auto-vectorization is heuristic-based, may not trigger
    • SIMD JIT: Always uses SIMD regardless of surrounding code
  3. Fewer loop iterations: 4 samples per iteration

    • Native: 44100 iterations with bounds checks
    • SIMD JIT: 11025 SIMD iterations + up to 3 scalar for tail
  4. Simpler code structure: JIT generates minimal loop

    • Native: Rust's iterator machinery, slice methods, potential inlining failures
    • SIMD JIT: Straight-line load/compute/store

Benchmark note: The "native" benchmark uses a simple for loop with indexing, which is idiomatic Rust but doesn't use unsafe optimizations. A hand-optimized unsafe Rust version would be competitive.

Decision: Graph Optimization Passes First

Rather than optimizing JIT codegen, focus on graph-level optimization passes that run before any execution/compilation. These benefit both dynamic execution and JIT.

Why This Approach

  1. JIT isn't the bottleneck - Dynamic dispatch at 44.1kHz is fast enough
  2. Optimization passes have broader value - Work for audio, fields, images, any graph
  3. Reduce external function calls - Fusing 10 nodes to 2 = 8 fewer calls/sample
  4. Codegen is mechanical - The hard part is recognizing patterns, not emitting code

Planned Optimization Passes

Algebraic Fusion:

Gain(0.5) -> Offset(1.0) -> Gain(2.0) -> Offset(-0.5)

Becomes:

AffineNode { gain: 1.0, offset: 0.5 }  // output = input * 1.0 + 0.5

Simplification:

  • IdentityElim - Remove Gain(1.0), Offset(0.0), PassThrough
  • DeadNodeElim - Remove unreachable nodes
  • ConstantFold - Constant(2.0) -> Gain(0.5)Constant(1.0)

Implementation Plan

  1. Define GraphOptimizer trait
  2. Implement AffineChainFusion pass
  3. Implement IdentityElim pass
  4. Implement DeadNodeElim pass
  5. Test on audio graphs, then generalize

Future ✅ COMPLETED

These items have been implemented in resin-jit:

  • ✅ Extract JIT to resin-jit crate (generic over graph type)
  • ✅ Add SIMD codegen for pure-math chains (f32x4, 41x speedup)
  • [ ] Apply to field expressions, image pipelines (Phase 3)

Appendix: Raw Numbers

Current Benchmarks (resin-jit with SIMD)

All benchmarks on 44100 samples (1 second of audio):

ImplementationTimePer-sampleNotes
SIMD JIT (f32x4)5.1 µs0.12 nsFastest - explicit vectorization
Native Rust loop31.4 µs0.71 nsBounds checking overhead
Scalar JIT209 µs4.7 nsNo vectorization

Historical Benchmarks (before SIMD)

ImplementationTimePer-sample
Native Rust loop15 µs0.34 ns
BlockProcessor (Gain)2.6 µs0.06 ns
JIT block (pure math)20 µs0.45 ns
JIT block (stateful)120 µs2.7 ns
Per-sample JIT60-85 µs1.4-1.9 ns
External fn call~71 µs1.6 ns

All times are well within real-time budget for audio (22.7 ms available per 44100 samples at 44.1kHz).

Observations

Before SIMD (historical)

  • For simple operations like Gain, native Rust with BlockProcessor beat JIT by 20-40x
  • JIT value proposition was limited to complex graph routing

After SIMD (current)

  • SIMD JIT now beats native Rust by 6.6x for pure-math operations
  • The performance inversion is due to:
    • Explicit vectorization (guaranteed, not heuristic)
    • No bounds checking (unsafe pointer math)
    • Minimal loop overhead
  • JIT value proposition is now compelling for any buffer processing:
    • Dynamically-built graphs compile to faster-than-native code
    • Block processing amortizes compilation cost
    • SIMD benefits compound with graph complexity

Remaining Challenges

  • Stateful nodes (delay, filter) still require Rust callbacks
  • Graph compilation (compile_graph()) not yet ported to resin-jit
  • Field expressions (Phase 3) need recursive AST → IR translation ✅ DONE

Phase 3: Field Expression JIT (resin-expr-field)

Field expressions (FieldExpr) can now be JIT-compiled to native code.

Implementation

The FieldExprCompiler in resin-expr-field compiles a FieldExpr AST to a function fn(x, y, z, t) -> f32.

Key features:

  • Pure Cranelift perlin2: Noise is fully inlined, no Rust boundary crossing
  • Polynomial transcendentals: sin, cos, tan, exp, ln use optimized polynomial approximations
  • Other noise: simplex2/3, perlin3, fbm use external calls (future: inline these too)

Pure Cranelift perlin2

The perlin2 noise function is implemented entirely in Cranelift IR:

┌─────────────────────────────────────────────────────────────┐
│  emit_perlin2(x, y)                                         │
├─────────────────────────────────────────────────────────────┤
│  1. Floor + fractional: xi, yi, xf, yf                      │
│  2. Fade curves: u = fade(xf), v = fade(yf)                 │
│  3. Hash corners: emit_perm() × 4                           │
│  4. Gradients: emit_grad2() × 4                             │
│  5. Bilinear interpolation: emit_lerp() × 3                 │
│  6. Scale to [0, 1]                                         │
└─────────────────────────────────────────────────────────────┘

Perm table access: Uses direct pointer to the static PERM array (safe because it's program-lifetime).

Parity: Verified exact match with rhizome_resin_noise::perlin2() across 2500 test points.

Polynomial Transcendentals

FunctionMethodMax Error
sin(x)Range reduction to [-π/2, π/2] + degree-9 minimax< 0.05
cos(x)sin(x + π/2)< 0.05
tan(x)sin(x) / cos(x)< 0.05
exp(x)2^(x·log2(e)) via bit manipulation + polynomial< 1% relative
ln(x)Exponent extraction + polynomial for ln(1+t)< 0.05

These approximations are suitable for procedural graphics where ~1% error is imperceptible.

What's NOT Inlined (Yet)

FunctionStatusReason
perlin3External callLarger, 8 corners
simplex2/3External callComplex geometry, conditionals
fbmExternal callLoop with multiple perlin calls

These could be inlined for additional performance, but the benefit is smaller since:

  • Noise is typically a small part of a complex expression
  • External calls are still fast (~10-20 cycles overhead)

Benchmark Results (10,000 evaluations)

ExpressionInterpretedJITSpeedup
perlin(x*4, y*4) * 0.5 + 0.5415 µs110 µs3.8x
sin(x*π) * cos(y*π)284 µs73 µs3.9x
sdf_circle(x, y, 0.5)83 µs48 µs1.7x
Compile time-363 µs-

Why the Speedups?

Perlin (3.8x) - Same precision, faster execution:

  • ✅ Exact parity with Rust impl (verified across 2500 points)
  • No Rust function call overhead (~10-20 cycles saved per call)
  • All operations fully inlined (no dynamic dispatch)
  • Cranelift optimizer sees whole computation graph
  • Direct PERM table lookup via pointer (no bounds checking)

Trig (3.9x) - Lower precision approximations:

  • ⚠️ Uses polynomial approximations instead of libm
  • Max error ~0.05 (vs libm's ~1e-7)
  • Acceptable for procedural graphics (imperceptible)
  • Interpreted calls f32::sin()/f32::cos() → libm → ~50-100 cycles each
  • JIT uses ~15-20 instruction polynomial → ~10-15 cycles each

SDF (1.7x) - Same precision, less overhead:

  • ✅ Same arithmetic (sqrt, sub, mul)
  • Speedup from eliminating:
    • AST traversal per eval (~5 enum matches)
    • HashMap lookup for variables
    • Box indirection for child expressions

When to Use JIT

EvaluationsInterpretedJIT (incl. compile)Winner
10.04 µs363 µsInterpreted
1004 µs364 µsInterpreted
1,00042 µs374 µsInterpreted
10,000415 µs473 µs~Equal
100,0004.2 ms1.5 msJIT 2.8x
1,000,00042 ms11 msJIT 3.8x

Rule of thumb: JIT is worth it for >10k evaluations (e.g., 100×100 texture)