Skip to content

Recipe 65: Composing Two Fabric-Leased Engines for Speculative Decode

Situation

Single-engine inference (Recipe 61) produces correct tokens but pays a full forward per token. For latency-sensitive workloads (chat, code completion), throughput matters: how do you serve more tokens-per-second without scaling out hardware?

Speculative decode is the standard answer. A draft model (smaller, cheaper) proposes K candidate tokens; a target model (larger, more accurate) verifies them in ONE batched forward pass. Tokens the target’s argmax agrees with are accepted; the first disagreement rolls back. When the draft and target agree often (a “high acceptance rate”), throughput approaches K + 1 tokens per target forward — a substantial speedup.

The challenge: speculative decode is the simplest non-trivial composition of two engines on one fabric. Both engines hold broker-owned weight regions, both pay Recipe 63’s revocation cost, and the protocol between them MUST preserve correctness — accepted tokens must equal what a target-only greedy decode would have emitted.

What You Build

A two-engine speculative decode session:

  • Draft engine: Qwen 2.5 0.5B Q4_K_M, fabric-leased, runs K proposal forwards per cycle.
  • Target engine: Qwen 2.5 1.5B Q4_K_M, fabric-leased, runs one batched verify forward per cycle on K+1 token positions.
  • Acceptance protocol: accept the longest prefix where target’s argmax matches draft’s proposed token; emit the accepted prefix
    • one extra “free” token from target’s logits at the first disagreement position; advance both engines’ KV caches by exactly the accepted count.

The contract: end-to-end greedy emit sequence equals target-only greedy decode’s emit sequence. No correctness loss from speculation; only throughput win.

Building Blocks

  • Two instances of FabricNativeCudaEngine (one draft, one target), each with its own LeaseRegion-backed weights.
  • SpeculativeDecodeEngine — the wrapper that holds both engines and implements the propose/verify/accept loop.
  • forward_prefill_cuda_paged_all_logits — the target-side batched verify entrypoint (returns logits for K+1 token positions in one forward).
  • paged_flash_attention_multi_head_fused_q_f32_cuda — the FA kernel the batched verify path uses; correct under q_start_pos != 0 (recipe-58 batched-verify-causal-fix).

See:

Design

Resource Model

Each engine independently:

  • Acquires its own GPU lease + memory regions for weights.
  • Runs its own paged KV cache.
  • Pays Recipe 63’s intra-kernel revoke cost.

The speculative engine wires them together at the propose/verify/accept boundary. From the fabric’s POV they’re two tenants on potentially-different GPUs (or the same GPU sharing memory pools).

Acceptance Rule (Greedy)

For step i of K proposals:

  1. Draft emits proposal[i] = argmax(draft_logits_at_position_i).
  2. Target’s batched verify gives target_argmax[i] = argmax(target_logits_at_position_i).
  3. If proposal[i] == target_argmax[i], accept. Else, reject from position i onward.
  4. Emit accepted_prefix + target_argmax[first_disagreement_or_K] — i.e., accepted_count + 1 tokens per cycle.

Isolation and Safety

  • Both engines’ weights are fabric-leased; Recipe 63’s revoke semantics apply to both. A revoke on the draft engine doesn’t kill the target’s session, and vice versa — the wrapper surfaces the revocation as a typed error.
  • The acceptance rule preserves greedy correctness exactly: at every accepted position, target’s argmax was the emitted token. Equivalent to running target-only greedy without speculation.

Throughput Model

Each cycle costs: 1 batched target forward + K draft forwards. With acceptance rate α:

  • Tokens emitted per cycle: α·K + 1.
  • Target-only baseline: 1 token per target forward.
  • Speedup: (α·K + 1) / (1 + K·(t_draft / t_target)) where t_draft / t_target is the cost ratio of one draft forward to one target forward. For Qwen 0.5B / 1.5B at L4 this ratio is roughly 0.3.

At α = 0.95 and K = 4, speedup ≈ 2.7× over target-only greedy.

Walkthrough (Implementation Sketch)

1. Acquire Both Engines

let draft = FabricNativeCudaEngine::new_paged(entropy.clone(), 1024);
let target = FabricNativeCudaEngine::new_paged(entropy.clone(), 1024);
let draft_model = draft.load_model(&mut draft_src, cfg.clone()).await?;
let target_model = target.load_model(&mut target_src, cfg).await?;

2. Wire into SpeculativeDecodeEngine

let mut spec = SpeculativeDecodeEngine::new(
draft, draft_model,
target, target_model,
SpecConfig { k: 4, /* ... */ },
);
let mut seq = spec.create_sequence(1024)?;
spec.submit_prompt(&mut seq, &prompt_ids)?;

3. Decode Loop

let sampling = SamplingParams::greedy(1);
for _ in 0..n_tokens {
let emitted = spec.decode_one(&mut seq, &sampling).await?;
sink(emitted);
}

4. Internal: One Cycle

// Pseudocode for the wrapper's per-cycle work.
let proposals: [u32; K] = (0..K).map(|_| draft.decode_one(...)).collect();
let target_logits = target.forward_prefill_paged_all_logits(
&[last_emitted, proposals[0], ..., proposals[K-1]],
).await?;
let target_argmax = target_logits.iter().map(|l| l.argmax()).collect();
let accepted = proposals.iter().zip(&target_argmax)
.take_while(|(p, t)| p == t).count();
emit_n_tokens(&proposals[..accepted]);
emit_one_token(target_argmax[accepted]); // "free" token at first disagreement
draft.rollback_kv(K - accepted); // un-advance rejected positions
target.rollback_kv(K - accepted);

Verification (Silicon)

On AWS L4 (i-0b00f3d6f383200d4, 2026-05-23, commit b065c11c):

test speculative_batched_verify_paged_matches_sequential_decode_one ... ok
sequential a_tokens: [2776, 264, 501, 1196, 1588]
batched b_tokens: [2776, 264, 501, 1196, 1588]
test speculative_decode_accept_rate_in_reasonable_range ... ok
DIAG spec step 0: proposals=[11, 323, 279, 3974] accept_count=0
DIAG spec step 1: proposals=[1986, 374, 264, 11416] accept_count=0
DIAG spec step 2: proposals=[3974, 13876, 38835, 34208] accept_count=4
DIAG spec step 3: proposals=[279, 15678, 5562, 11] accept_count=3
DIAG spec step 4: proposals=[785, 3974, 13876, 38835] accept_count=4
recent_accept_rate after 50 steps: 0.955
test speculative_decode_greedy_matches_target_only_greedy ... ok
test speculative_decode_kv_position_advances_by_emitted_count ... ok
test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 237.40s
STEP 7 EXIT: 0

Four pins, all green:

PinClaim
speculative_batched_verify_paged_matches_sequential_decode_oneBatched K+1 verify forward produces the same tokens as K+1 sequential decode_one calls.
speculative_decode_accept_rate_in_reasonable_rangeOver 50 steps, draft argmax matches target argmax at 95.5% of proposed positions.
speculative_decode_greedy_matches_target_only_greedyEnd-to-end emit sequence equals target-only greedy. No quality loss.
speculative_decode_kv_position_advances_by_emitted_countKV caches advance by the accepted-token count, not the proposed-token count. Cache state stays sound under rejection.

Run the pins yourself (requires BOTH GGUFs):

Terminal window
FABRIC_TEST_MODEL_PATH_DRAFT=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
FABRIC_TEST_MODEL_PATH_TARGET=/opt/grafos/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
cargo test --release \
-p grafos-inference-engine \
--features cuda,speculative-decode,test-helpers \
--test cuda_engine_e2e \
-- --ignored --nocapture --test-threads=1 \
speculative_decode_greedy_matches_target_only_greedy \
speculative_decode_kv_position_advances_by_emitted_count \
speculative_decode_accept_rate_in_reasonable_range \
speculative_batched_verify_paged_matches_sequential_decode_one

--test-threads=1 is mandatory on g6.xlarge (16 GiB RAM): 3 pins × 2 engines × ~1.5 GiB each exhausts host memory.

Failure Modes

  • Causal mask drift between batched and sequential paths. The paged batched-verify FA kernel masks at kg > q_start_pos + q_row_local. If a future kernel change in either the contig or paged path reverts to local-only Q indexing, batched vs sequential will diverge. Recipe 63’s cuda_engine_e2e_intra_kernel_revoke_bounds pin will continue to pass; only this recipe’s first pin will catch the divergence.
  • KV cache rollback bugs. If rollback_kv(K - accepted) un-advances by the wrong count, subsequent decodes operate against stale KV positions. The speculative_decode_kv_position_advances_by_emitted_count pin catches this.
  • Low acceptance rate (≪0.5) makes speculation slower than baseline. A poor draft/target affinity (different architectures, different tokenizers, very different sizes) produces α near 0; speculation pays the draft cost per cycle for no throughput gain. Use the recent_accept_rate event to detect and disable speculation dynamically.
  • Per-engine revoke during a cycle. If the draft revokes mid- cycle, the target’s verify has nothing to verify against. The wrapper bails with the revocation error; the application must re-acquire (Recipe 64).

Observability

Each cycle emits grafos_observe::Event::SpecCycle with {step, proposals, target_argmax, accepted_count, emitted_count}. Aggregating over many cycles produces the live acceptance_rate signal a scheduler can use to start/stop speculation per tenant.

Variations

  • Top-K / top-P sampling instead of greedy. Speculative decode with stochastic sampling requires a different acceptance rule (matching the underlying distribution via rejection sampling). Filed as a follow-up; not yet implemented.
  • N-way speculation (N drafts, 1 target). Multiple drafts propose in parallel; target verifies all proposals in one batched forward. Higher acceptance via diversity, higher draft cost.
  • Cross-tenant speculation sharing. Multiple tenants share the same target engine; their drafts run in their own leases. The scheduler routes proposals to the shared target’s batched verify queue.
  • Heterogeneous quant draft/target. Both Qwens above are Q4_K_M. Cross-quant speculation (e.g., Q4_K_M draft + Q8_0 target) works in principle but isn’t pinned here.

Why This Is Recipe 65

Recipes 61–64 establish that a single engine is correct, efficient, and revocable. Recipe 65 establishes that two engines compose into a higher-level inference primitive without losing those properties. Recipe 66 establishes the orthogonal composition (N tenants in ONE engine’s forward). Together they show fabricBIOS’ inference primitive composes across two orthogonal dimensions.