JIT Optimization Log
Experiments with Cranelift JIT for audio graph compilation.
Current State (resin-jit)
The resin-jit crate provides generic JIT compilation with explicit SIMD support.
Performance Summary (44100 samples = 1 second of audio)
| Mode | Time | vs Scalar | vs Native |
|---|---|---|---|
| Scalar JIT | 209 µs | 1x | 6.6x slower |
| SIMD JIT (f32x4) | 5.1 µs | 41x faster | 6.6x faster |
| Native Rust loop | 31.4 µs | 6.6x faster | 1x |
Key insight: SIMD JIT is faster than native Rust because:
- Zero bounds checking (direct pointer arithmetic)
- Explicit f32x4 vectorization (LLVM didn't auto-vectorize the native loop)
- 4x fewer loop iterations
Architecture
┌─────────────────────────────────────────────────────────────┐
│ resin-jit │
├─────────────────────────────────────────────────────────────┤
│ Traits: │
│ ├── JitCompilable emit_ir() for single values │
│ ├── SimdCompilable emit_simd_ir() for f32x4 vectors │
│ └── JitGraph graph traversal for compilation │
│ │
│ Classification (JitCategory): │
│ ├── PureMath inline, SIMD-able (gain, clip) │
│ ├── Stateful callbacks to Rust (delay, filter) │
│ └── External call external fn (noise, sin/cos) │
│ │
│ Compilation: │
│ ├── compile_affine() scalar: fn(f32) -> f32 │
│ └── compile_affine_simd() block: fn(*f32, *f32, len) │
└─────────────────────────────────────────────────────────────┘SIMD Implementation Details
The compile_affine_simd() method generates a loop structure:
┌─────────────────────────────────────────────────────────────┐
│ SIMD Loop (processes 4 samples per iteration) │
│ ├── Load f32x4 from input[i*4] │
│ ├── fmul with gain vector (splatted) │
│ ├── fadd with offset vector (splatted) │
│ └── Store f32x4 to output[i*4] │
├─────────────────────────────────────────────────────────────┤
│ Scalar Tail (handles remainder: len % 4) │
│ ├── Load single f32 │
│ ├── fmul, fadd │
│ └── Store single f32 │
└─────────────────────────────────────────────────────────────┘When to Use
| Scenario | Recommendation |
|---|---|
| Compile-time known graph | Tier 4 codegen (zero overhead) |
| Simple effects | BlockProcessor trait (LLVM optimizes) |
| Dynamic graphs, pure math | SIMD JIT (5 µs for 44100 samples) |
| Dynamic graphs, stateful | Scalar JIT or interpret (callbacks needed) |
Parity Verification
Extensive tests verify scalar JIT == SIMD JIT == native Rust:
- Typical audio data (sine waves, -1 to 1 range)
- Random data (multiple seeds)
- 25 different buffer sizes (alignment edge cases)
- 88 gain/offset parameter combinations
- Edge values (tiny, huge, special floats)
Baseline Measurements (44100 samples = 1 second of audio)
| Benchmark | Time | Description |
|---|---|---|
gain_rust_1sec | 15 µs | Pure Rust loop: sample * gain |
gain_block_1sec | 2.6-3.1 µs | BlockProcessor trait on Gain node |
gain_jit_1sec | 61-66 µs | Per-sample JIT (compile_gain) |
The native Rust BlockProcessor is fastest because LLVM can vectorize the simple multiply loop.
Experiment 1: Block-based JIT with external context advancement
Goal: Amortize JIT function call overhead by processing blocks instead of per-sample.
Implementation:
compile_graph()generates a loop that processes all samples- External
advance_context()function called per sample to updatectx.timeandctx.sample_index
Result: gain_jit_block_1sec = 71 µs
Analysis: Still slower than per-sample JIT (61 µs). The external function call per sample adds overhead, but at least it's in the same ballpark.
Experiment 2: Inline context advancement
Goal: Eliminate external function call overhead by inlining the context update.
Implementation:
- Added
#[repr(C)]toAudioContextandGraphStatefor predictable layout - Moved
ctxto first field inGraphState(offset 0) - Generated inline load/add/store instructions for
time += dtandsample_index += 1
Result: gain_jit_block_1sec = 120-121 µs (SLOWER!)
Analysis: Counterintuitively, the inline version is ~70% slower than the external call. Possible causes:
- Memory access pattern preventing optimization
- Increased code size in the hot loop affecting I-cache
- Cranelift not optimizing the loads/stores well
- The
dtload could be hoisted outside the loop (it's constant) but isn't
Experiment 3: Skip context advancement for pure-math graphs
Goal: Avoid unnecessary work when no nodes need the context.
Implementation:
- Check
has_statefulflag fromanalyze_graph() - Only emit context advancement code if there are stateful nodes
Result: gain_jit_block_1sec = 20 µs (down from 121 µs!)
Analysis: 6x improvement for pure-math graphs! But still 8x slower than native (2.6 µs).
Important caveat: This is not a fair comparison:
- Native
BlockProcessorDOES advance context (ctx.advance()) every sample - But LLVM can inline, keep values in registers, and vectorize the loop
- JIT does scalar operations without SIMD
Why Native WAS Faster (Before SIMD)
Native Rust's 2.6 µs vs scalar JIT's 20 µs:
- Native: LLVM can vectorize
output[i] = input[i] * gaininto SIMD (8 samples at once) - Scalar JIT: Cranelift generates scalar code (1 sample at a time)
- Native: Context values stay in registers
- Scalar JIT: Every access is a memory load/store
Why SIMD JIT is Now Faster Than Native
After implementing explicit SIMD, JIT beats native Rust (5.1 µs vs 31.4 µs):
No bounds checking: JIT uses raw pointer arithmetic
- Native:
output[i] = input[i] * gainhas 2 bounds checks per iteration - SIMD JIT: Direct pointer offset, no checks
- Native:
Guaranteed vectorization: We explicitly emit f32x4 operations
- Native: LLVM auto-vectorization is heuristic-based, may not trigger
- SIMD JIT: Always uses SIMD regardless of surrounding code
Fewer loop iterations: 4 samples per iteration
- Native: 44100 iterations with bounds checks
- SIMD JIT: 11025 SIMD iterations + up to 3 scalar for tail
Simpler code structure: JIT generates minimal loop
- Native: Rust's iterator machinery, slice methods, potential inlining failures
- SIMD JIT: Straight-line load/compute/store
Benchmark note: The "native" benchmark uses a simple for loop with indexing, which is idiomatic Rust but doesn't use unsafe optimizations. A hand-optimized unsafe Rust version would be competitive.
Decision: Graph Optimization Passes First
Rather than optimizing JIT codegen, focus on graph-level optimization passes that run before any execution/compilation. These benefit both dynamic execution and JIT.
Why This Approach
- JIT isn't the bottleneck - Dynamic dispatch at 44.1kHz is fast enough
- Optimization passes have broader value - Work for audio, fields, images, any graph
- Reduce external function calls - Fusing 10 nodes to 2 = 8 fewer calls/sample
- Codegen is mechanical - The hard part is recognizing patterns, not emitting code
Planned Optimization Passes
Algebraic Fusion:
Gain(0.5) -> Offset(1.0) -> Gain(2.0) -> Offset(-0.5)Becomes:
AffineNode { gain: 1.0, offset: 0.5 } // output = input * 1.0 + 0.5Simplification:
IdentityElim- RemoveGain(1.0),Offset(0.0),PassThroughDeadNodeElim- Remove unreachable nodesConstantFold-Constant(2.0) -> Gain(0.5)→Constant(1.0)
Implementation Plan
- Define
GraphOptimizertrait - Implement
AffineChainFusionpass - Implement
IdentityElimpass - Implement
DeadNodeElimpass - Test on audio graphs, then generalize
Future ✅ COMPLETED
These items have been implemented in resin-jit:
- ✅ Extract JIT to
resin-jitcrate (generic over graph type) - ✅ Add SIMD codegen for pure-math chains (f32x4, 41x speedup)
- [ ] Apply to field expressions, image pipelines (Phase 3)
Appendix: Raw Numbers
Current Benchmarks (resin-jit with SIMD)
All benchmarks on 44100 samples (1 second of audio):
| Implementation | Time | Per-sample | Notes |
|---|---|---|---|
| SIMD JIT (f32x4) | 5.1 µs | 0.12 ns | Fastest - explicit vectorization |
| Native Rust loop | 31.4 µs | 0.71 ns | Bounds checking overhead |
| Scalar JIT | 209 µs | 4.7 ns | No vectorization |
Historical Benchmarks (before SIMD)
| Implementation | Time | Per-sample |
|---|---|---|
| Native Rust loop | 15 µs | 0.34 ns |
| BlockProcessor (Gain) | 2.6 µs | 0.06 ns |
| JIT block (pure math) | 20 µs | 0.45 ns |
| JIT block (stateful) | 120 µs | 2.7 ns |
| Per-sample JIT | 60-85 µs | 1.4-1.9 ns |
| External fn call | ~71 µs | 1.6 ns |
All times are well within real-time budget for audio (22.7 ms available per 44100 samples at 44.1kHz).
Observations
Before SIMD (historical)
- For simple operations like Gain, native Rust with
BlockProcessorbeat JIT by 20-40x - JIT value proposition was limited to complex graph routing
After SIMD (current)
- SIMD JIT now beats native Rust by 6.6x for pure-math operations
- The performance inversion is due to:
- Explicit vectorization (guaranteed, not heuristic)
- No bounds checking (unsafe pointer math)
- Minimal loop overhead
- JIT value proposition is now compelling for any buffer processing:
- Dynamically-built graphs compile to faster-than-native code
- Block processing amortizes compilation cost
- SIMD benefits compound with graph complexity
Remaining Challenges
- Stateful nodes (delay, filter) still require Rust callbacks
- Graph compilation (
compile_graph()) not yet ported to resin-jit Field expressions (Phase 3) need recursive AST → IR translation✅ DONE
Phase 3: Field Expression JIT (resin-expr-field)
Field expressions (FieldExpr) can now be JIT-compiled to native code.
Implementation
The FieldExprCompiler in resin-expr-field compiles a FieldExpr AST to a function fn(x, y, z, t) -> f32.
Key features:
- Pure Cranelift perlin2: Noise is fully inlined, no Rust boundary crossing
- Polynomial transcendentals: sin, cos, tan, exp, ln use optimized polynomial approximations
- Other noise: simplex2/3, perlin3, fbm use external calls (future: inline these too)
Pure Cranelift perlin2
The perlin2 noise function is implemented entirely in Cranelift IR:
┌─────────────────────────────────────────────────────────────┐
│ emit_perlin2(x, y) │
├─────────────────────────────────────────────────────────────┤
│ 1. Floor + fractional: xi, yi, xf, yf │
│ 2. Fade curves: u = fade(xf), v = fade(yf) │
│ 3. Hash corners: emit_perm() × 4 │
│ 4. Gradients: emit_grad2() × 4 │
│ 5. Bilinear interpolation: emit_lerp() × 3 │
│ 6. Scale to [0, 1] │
└─────────────────────────────────────────────────────────────┘Perm table access: Uses direct pointer to the static PERM array (safe because it's program-lifetime).
Parity: Verified exact match with rhizome_resin_noise::perlin2() across 2500 test points.
Polynomial Transcendentals
| Function | Method | Max Error |
|---|---|---|
| sin(x) | Range reduction to [-π/2, π/2] + degree-9 minimax | < 0.05 |
| cos(x) | sin(x + π/2) | < 0.05 |
| tan(x) | sin(x) / cos(x) | < 0.05 |
| exp(x) | 2^(x·log2(e)) via bit manipulation + polynomial | < 1% relative |
| ln(x) | Exponent extraction + polynomial for ln(1+t) | < 0.05 |
These approximations are suitable for procedural graphics where ~1% error is imperceptible.
What's NOT Inlined (Yet)
| Function | Status | Reason |
|---|---|---|
| perlin3 | External call | Larger, 8 corners |
| simplex2/3 | External call | Complex geometry, conditionals |
| fbm | External call | Loop with multiple perlin calls |
These could be inlined for additional performance, but the benefit is smaller since:
- Noise is typically a small part of a complex expression
- External calls are still fast (~10-20 cycles overhead)
Benchmark Results (10,000 evaluations)
| Expression | Interpreted | JIT | Speedup |
|---|---|---|---|
perlin(x*4, y*4) * 0.5 + 0.5 | 415 µs | 110 µs | 3.8x |
sin(x*π) * cos(y*π) | 284 µs | 73 µs | 3.9x |
sdf_circle(x, y, 0.5) | 83 µs | 48 µs | 1.7x |
| Compile time | - | 363 µs | - |
Why the Speedups?
Perlin (3.8x) - Same precision, faster execution:
- ✅ Exact parity with Rust impl (verified across 2500 points)
- No Rust function call overhead (~10-20 cycles saved per call)
- All operations fully inlined (no dynamic dispatch)
- Cranelift optimizer sees whole computation graph
- Direct PERM table lookup via pointer (no bounds checking)
Trig (3.9x) - Lower precision approximations:
- ⚠️ Uses polynomial approximations instead of libm
- Max error ~0.05 (vs libm's ~1e-7)
- Acceptable for procedural graphics (imperceptible)
- Interpreted calls
f32::sin()/f32::cos()→ libm → ~50-100 cycles each - JIT uses ~15-20 instruction polynomial → ~10-15 cycles each
SDF (1.7x) - Same precision, less overhead:
- ✅ Same arithmetic (sqrt, sub, mul)
- Speedup from eliminating:
- AST traversal per eval (~5 enum matches)
- HashMap lookup for variables
- Box indirection for child expressions
When to Use JIT
| Evaluations | Interpreted | JIT (incl. compile) | Winner |
|---|---|---|---|
| 1 | 0.04 µs | 363 µs | Interpreted |
| 100 | 4 µs | 364 µs | Interpreted |
| 1,000 | 42 µs | 374 µs | Interpreted |
| 10,000 | 415 µs | 473 µs | ~Equal |
| 100,000 | 4.2 ms | 1.5 ms | JIT 2.8x |
| 1,000,000 | 42 ms | 11 ms | JIT 3.8x |
Rule of thumb: JIT is worth it for >10k evaluations (e.g., 100×100 texture)