JIT Optimization Log

Experiments with Cranelift JIT for audio graph compilation.

Current State (resin-jit)

The resin-jit crate provides generic JIT compilation with explicit SIMD support.

Performance Summary (44100 samples = 1 second of audio)

Mode	Time	vs Scalar	vs Native
Scalar JIT	209 µs	1x	6.6x slower
SIMD JIT (f32x4)	5.1 µs	41x faster	6.6x faster
Native Rust loop	31.4 µs	6.6x faster	1x

Key insight: SIMD JIT is faster than native Rust because:

Zero bounds checking (direct pointer arithmetic)
Explicit f32x4 vectorization (LLVM didn't auto-vectorize the native loop)
4x fewer loop iterations

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      resin-jit                              │
├─────────────────────────────────────────────────────────────┤
│  Traits:                                                    │
│  ├── JitCompilable      emit_ir() for single values         │
│  ├── SimdCompilable     emit_simd_ir() for f32x4 vectors    │
│  └── JitGraph           graph traversal for compilation     │
│                                                             │
│  Classification (JitCategory):                              │
│  ├── PureMath           inline, SIMD-able (gain, clip)      │
│  ├── Stateful           callbacks to Rust (delay, filter)   │
│  └── External           call external fn (noise, sin/cos)   │
│                                                             │
│  Compilation:                                               │
│  ├── compile_affine()       scalar: fn(f32) -> f32          │
│  └── compile_affine_simd()  block: fn(*f32, *f32, len)      │
└─────────────────────────────────────────────────────────────┘

SIMD Implementation Details

The compile_affine_simd() method generates a loop structure:

┌─────────────────────────────────────────────────────────────┐
│  SIMD Loop (processes 4 samples per iteration)              │
│  ├── Load f32x4 from input[i*4]                             │
│  ├── fmul with gain vector (splatted)                       │
│  ├── fadd with offset vector (splatted)                     │
│  └── Store f32x4 to output[i*4]                             │
├─────────────────────────────────────────────────────────────┤
│  Scalar Tail (handles remainder: len % 4)                   │
│  ├── Load single f32                                        │
│  ├── fmul, fadd                                             │
│  └── Store single f32                                       │
└─────────────────────────────────────────────────────────────┘

When to Use

Scenario	Recommendation
Compile-time known graph	Tier 4 codegen (zero overhead)
Simple effects	`BlockProcessor` trait (LLVM optimizes)
Dynamic graphs, pure math	SIMD JIT (5 µs for 44100 samples)
Dynamic graphs, stateful	Scalar JIT or interpret (callbacks needed)

Parity Verification

Extensive tests verify scalar JIT == SIMD JIT == native Rust:

Typical audio data (sine waves, -1 to 1 range)
Random data (multiple seeds)
25 different buffer sizes (alignment edge cases)
88 gain/offset parameter combinations
Edge values (tiny, huge, special floats)

Baseline Measurements (44100 samples = 1 second of audio)

Benchmark	Time	Description
`gain_rust_1sec`	15 µs	Pure Rust loop: `sample * gain`
`gain_block_1sec`	2.6-3.1 µs	`BlockProcessor` trait on `Gain` node
`gain_jit_1sec`	61-66 µs	Per-sample JIT (`compile_gain`)

The native Rust BlockProcessor is fastest because LLVM can vectorize the simple multiply loop.

Experiment 1: Block-based JIT with external context advancement

Goal: Amortize JIT function call overhead by processing blocks instead of per-sample.

Implementation:

compile_graph() generates a loop that processes all samples
External advance_context() function called per sample to update ctx.time and ctx.sample_index

Result: gain_jit_block_1sec = 71 µs

Analysis: Still slower than per-sample JIT (61 µs). The external function call per sample adds overhead, but at least it's in the same ballpark.

Experiment 2: Inline context advancement

Goal: Eliminate external function call overhead by inlining the context update.

Implementation:

Added #[repr(C)] to AudioContext and GraphState for predictable layout
Moved ctx to first field in GraphState (offset 0)
Generated inline load/add/store instructions for time += dt and sample_index += 1

Result: gain_jit_block_1sec = 120-121 µs (SLOWER!)

Analysis: Counterintuitively, the inline version is ~70% slower than the external call. Possible causes:

Memory access pattern preventing optimization
Increased code size in the hot loop affecting I-cache
Cranelift not optimizing the loads/stores well
The dt load could be hoisted outside the loop (it's constant) but isn't

Experiment 3: Skip context advancement for pure-math graphs

Goal: Avoid unnecessary work when no nodes need the context.

Implementation:

Check has_stateful flag from analyze_graph()
Only emit context advancement code if there are stateful nodes

Result: gain_jit_block_1sec = 20 µs (down from 121 µs!)

Analysis: 6x improvement for pure-math graphs! But still 8x slower than native (2.6 µs).

Important caveat: This is not a fair comparison:

Native BlockProcessor DOES advance context (ctx.advance()) every sample
But LLVM can inline, keep values in registers, and vectorize the loop
JIT does scalar operations without SIMD

Why Native WAS Faster (Before SIMD)

Native Rust's 2.6 µs vs scalar JIT's 20 µs:

Native: LLVM can vectorize output[i] = input[i] * gain into SIMD (8 samples at once)
Scalar JIT: Cranelift generates scalar code (1 sample at a time)
Native: Context values stay in registers
Scalar JIT: Every access is a memory load/store

Why SIMD JIT is Now Faster Than Native

After implementing explicit SIMD, JIT beats native Rust (5.1 µs vs 31.4 µs):

No bounds checking: JIT uses raw pointer arithmetic
- Native: output[i] = input[i] * gain has 2 bounds checks per iteration
- SIMD JIT: Direct pointer offset, no checks
Guaranteed vectorization: We explicitly emit f32x4 operations
- Native: LLVM auto-vectorization is heuristic-based, may not trigger
- SIMD JIT: Always uses SIMD regardless of surrounding code
Fewer loop iterations: 4 samples per iteration
- Native: 44100 iterations with bounds checks
- SIMD JIT: 11025 SIMD iterations + up to 3 scalar for tail
Simpler code structure: JIT generates minimal loop
- Native: Rust's iterator machinery, slice methods, potential inlining failures
- SIMD JIT: Straight-line load/compute/store

Benchmark note: The "native" benchmark uses a simple for loop with indexing, which is idiomatic Rust but doesn't use unsafe optimizations. A hand-optimized unsafe Rust version would be competitive.

Decision: Graph Optimization Passes First

Rather than optimizing JIT codegen, focus on graph-level optimization passes that run before any execution/compilation. These benefit both dynamic execution and JIT.

Why This Approach

JIT isn't the bottleneck - Dynamic dispatch at 44.1kHz is fast enough
Optimization passes have broader value - Work for audio, fields, images, any graph
Reduce external function calls - Fusing 10 nodes to 2 = 8 fewer calls/sample
Codegen is mechanical - The hard part is recognizing patterns, not emitting code

Planned Optimization Passes

Algebraic Fusion:

Gain(0.5) -> Offset(1.0) -> Gain(2.0) -> Offset(-0.5)

Becomes:

AffineNode { gain: 1.0, offset: 0.5 }  // output = input * 1.0 + 0.5

Simplification:

IdentityElim - Remove Gain(1.0), Offset(0.0), PassThrough
DeadNodeElim - Remove unreachable nodes
ConstantFold - Constant(2.0) -> Gain(0.5) → Constant(1.0)

Implementation Plan

Define GraphOptimizer trait
Implement AffineChainFusion pass
Implement IdentityElim pass
Implement DeadNodeElim pass
Test on audio graphs, then generalize

Future ✅ COMPLETED

These items have been implemented in resin-jit:

✅ Extract JIT to resin-jit crate (generic over graph type)
✅ Add SIMD codegen for pure-math chains (f32x4, 41x speedup)
[ ] Apply to field expressions, image pipelines (Phase 3)

Appendix: Raw Numbers

Current Benchmarks (resin-jit with SIMD)

All benchmarks on 44100 samples (1 second of audio):

Implementation	Time	Per-sample	Notes
SIMD JIT (f32x4)	5.1 µs	0.12 ns	Fastest - explicit vectorization
Native Rust loop	31.4 µs	0.71 ns	Bounds checking overhead
Scalar JIT	209 µs	4.7 ns	No vectorization

Historical Benchmarks (before SIMD)

Implementation	Time	Per-sample
Native Rust loop	15 µs	0.34 ns
BlockProcessor (Gain)	2.6 µs	0.06 ns
JIT block (pure math)	20 µs	0.45 ns
JIT block (stateful)	120 µs	2.7 ns
Per-sample JIT	60-85 µs	1.4-1.9 ns
External fn call	~71 µs	1.6 ns

All times are well within real-time budget for audio (22.7 ms available per 44100 samples at 44.1kHz).

Observations

Before SIMD (historical)

For simple operations like Gain, native Rust with BlockProcessor beat JIT by 20-40x
JIT value proposition was limited to complex graph routing

After SIMD (current)

SIMD JIT now beats native Rust by 6.6x for pure-math operations
The performance inversion is due to:
- Explicit vectorization (guaranteed, not heuristic)
- No bounds checking (unsafe pointer math)
- Minimal loop overhead
JIT value proposition is now compelling for any buffer processing:
- Dynamically-built graphs compile to faster-than-native code
- Block processing amortizes compilation cost
- SIMD benefits compound with graph complexity

Remaining Challenges

Stateful nodes (delay, filter) still require Rust callbacks
Graph compilation (compile_graph()) not yet ported to resin-jit
~~Field expressions (Phase 3) need recursive AST → IR translation~~ ✅ DONE

Phase 3: Field Expression JIT (resin-expr-field)

Field expressions (FieldExpr) can now be JIT-compiled to native code.

Implementation

The FieldExprCompiler in resin-expr-field compiles a FieldExpr AST to a function fn(x, y, z, t) -> f32.

Key features:

Pure Cranelift perlin2: Noise is fully inlined, no Rust boundary crossing
Polynomial transcendentals: sin, cos, tan, exp, ln use optimized polynomial approximations
Other noise: simplex2/3, perlin3, fbm use external calls (future: inline these too)

Pure Cranelift perlin2

The perlin2 noise function is implemented entirely in Cranelift IR:

┌─────────────────────────────────────────────────────────────┐
│  emit_perlin2(x, y)                                         │
├─────────────────────────────────────────────────────────────┤
│  1. Floor + fractional: xi, yi, xf, yf                      │
│  2. Fade curves: u = fade(xf), v = fade(yf)                 │
│  3. Hash corners: emit_perm() × 4                           │
│  4. Gradients: emit_grad2() × 4                             │
│  5. Bilinear interpolation: emit_lerp() × 3                 │
│  6. Scale to [0, 1]                                         │
└─────────────────────────────────────────────────────────────┘

Perm table access: Uses direct pointer to the static PERM array (safe because it's program-lifetime).

Parity: Verified exact match with rhizome_resin_noise::perlin2() across 2500 test points.

Polynomial Transcendentals

Function	Method	Max Error
sin(x)	Range reduction to [-π/2, π/2] + degree-9 minimax	< 0.05
cos(x)	sin(x + π/2)	< 0.05
tan(x)	sin(x) / cos(x)	< 0.05
exp(x)	2^(x·log2(e)) via bit manipulation + polynomial	< 1% relative
ln(x)	Exponent extraction + polynomial for ln(1+t)	< 0.05

These approximations are suitable for procedural graphics where ~1% error is imperceptible.

What's NOT Inlined (Yet)

Function	Status	Reason
perlin3	External call	Larger, 8 corners
simplex2/3	External call	Complex geometry, conditionals
fbm	External call	Loop with multiple perlin calls

These could be inlined for additional performance, but the benefit is smaller since:

Noise is typically a small part of a complex expression
External calls are still fast (~10-20 cycles overhead)

Benchmark Results (10,000 evaluations)

Expression	Interpreted	JIT	Speedup
`perlin(x4, y4) * 0.5 + 0.5`	415 µs	110 µs	3.8x
`sin(xπ) cos(y*π)`	284 µs	73 µs	3.9x
`sdf_circle(x, y, 0.5)`	83 µs	48 µs	1.7x
Compile time	-	363 µs	-

Why the Speedups?

Perlin (3.8x) - Same precision, faster execution:

✅ Exact parity with Rust impl (verified across 2500 points)
No Rust function call overhead (~10-20 cycles saved per call)
All operations fully inlined (no dynamic dispatch)
Cranelift optimizer sees whole computation graph
Direct PERM table lookup via pointer (no bounds checking)

Trig (3.9x) - Lower precision approximations:

⚠️ Uses polynomial approximations instead of libm
Max error ~0.05 (vs libm's ~1e-7)
Acceptable for procedural graphics (imperceptible)
Interpreted calls f32::sin()/f32::cos() → libm → ~50-100 cycles each
JIT uses ~15-20 instruction polynomial → ~10-15 cycles each

SDF (1.7x) - Same precision, less overhead:

✅ Same arithmetic (sqrt, sub, mul)
Speedup from eliminating:
- AST traversal per eval (~5 enum matches)
- HashMap lookup for variables
- Box indirection for child expressions

When to Use JIT

Evaluations	Interpreted	JIT (incl. compile)	Winner
1	0.04 µs	363 µs	Interpreted
100	4 µs	364 µs	Interpreted
1,000	42 µs	374 µs	Interpreted
10,000	415 µs	473 µs	~Equal
100,000	4.2 ms	1.5 ms	JIT 2.8x
1,000,000	42 ms	11 ms	JIT 3.8x

Rule of thumb: JIT is worth it for >10k evaluations (e.g., 100×100 texture)

JIT Optimization Log ​

Current State (resin-jit) ​

Performance Summary (44100 samples = 1 second of audio) ​

Architecture ​

SIMD Implementation Details ​

When to Use ​

Parity Verification ​

Baseline Measurements (44100 samples = 1 second of audio) ​

Experiment 1: Block-based JIT with external context advancement ​

Experiment 2: Inline context advancement ​

Experiment 3: Skip context advancement for pure-math graphs ​

Why Native WAS Faster (Before SIMD) ​

Why SIMD JIT is Now Faster Than Native ​

Decision: Graph Optimization Passes First ​

Why This Approach ​

Planned Optimization Passes ​

Implementation Plan ​

Future ✅ COMPLETED ​

Appendix: Raw Numbers ​

Current Benchmarks (resin-jit with SIMD) ​

Historical Benchmarks (before SIMD) ​

Observations ​

Before SIMD (historical) ​

After SIMD (current) ​

Remaining Challenges ​

Phase 3: Field Expression JIT (resin-expr-field) ​

Implementation ​

Pure Cranelift perlin2 ​

Polynomial Transcendentals ​

What's NOT Inlined (Yet) ​

Benchmark Results (10,000 evaluations) ​

Why the Speedups? ​

When to Use JIT ​

JIT Optimization Log

Current State (resin-jit)

Performance Summary (44100 samples = 1 second of audio)

Architecture

SIMD Implementation Details

When to Use

Parity Verification

Baseline Measurements (44100 samples = 1 second of audio)

Experiment 1: Block-based JIT with external context advancement

Experiment 2: Inline context advancement

Experiment 3: Skip context advancement for pure-math graphs

Why Native WAS Faster (Before SIMD)

Why SIMD JIT is Now Faster Than Native

Decision: Graph Optimization Passes First

Why This Approach

Planned Optimization Passes

Implementation Plan

Future ✅ COMPLETED

Appendix: Raw Numbers

Current Benchmarks (resin-jit with SIMD)

Historical Benchmarks (before SIMD)

Observations

Before SIMD (historical)

After SIMD (current)

Remaining Challenges

Phase 3: Field Expression JIT (resin-expr-field)

Implementation

Pure Cranelift perlin2

Polynomial Transcendentals

What's NOT Inlined (Yet)

Benchmark Results (10,000 evaluations)

Why the Speedups?

When to Use JIT