Recipe 66: Batching Four Tenants Into One Decode Forward Pass

Situation

Recipe 65 composes two engines for one tenant’s throughput. Recipe 66 covers the orthogonal axis: N independent tenants — different prompts, different KV caches, no shared state — share ONE GPU forward pass, each emitting one token per cycle.

Without this primitive, fabric-leased inference can only serve one request per GPU per decode forward. With it, the GPU’s per-forward cost amortizes across N tenants and tail-latency stays bounded by the slowest tenant in the batch, not by the queue depth. This is the substrate for production LLM serving on shared hardware.

The contract: each tenant’s emitted token must be byte-identical to what they would have gotten from a serial per-tenant decode_one call. No cross-tenant leakage in KV state, attention pattern, softmax normalizer, or logits. The fabric does the multiplexing; per-tenant correctness is preserved.

What You Build

A multi-tenant batched-decode pin: load N=4 distinct prompts into 4 sequence handles on one paged engine, run N decode steps two ways (serially via decode_one per tenant, and batched via decode_step of all 4 handles at once), and assert the two per-tenant emit sequences are token-for-token identical.

Building Blocks

FabricNativeCudaEngine::new_paged(entropy, pool_size) — the paged-cache engine constructor; required because batched decode uses block-table indirection for per-sequence KV.
engine.create_sequence(&model_handle, max_kv) — per-tenant sequence handle, independent KV cache slot.
engine.decode_step(handles, samplings) — the batched decode entrypoint; takes N mutable handles, returns N emitted tokens.
paged_flash_attention_multi_head_fused_q_f32_cuda — the FA kernel with per-sequence block-table indirection. Reads K/V from [n_kv_heads, max_seq_len, head_dim] pools through each sequence’s k_block_indices_dev / v_block_indices_dev.
paged_kv_rope_v_append_batched — the batched K/V append kernel that writes each sequence’s new K/V row to its own block slot.

See:

Design

Resource Model

The paged engine owns ONE K/V pool of shape [n_kv_heads, max_seq_len, head_dim] (per layer). Each tenant’s sequence handle holds:

A per-sequence position counter (kv_position).
A per-sequence k_block_indices_dev / v_block_indices_dev buffer mapping logical position → physical block slot in the shared pool.
A per-sequence Q-bias / RoPE configuration (typically shared across tenants of the same model, but distinct handles).

The pool is allocated once per engine; tenants lease block slots through the pool’s allocator. Slots are released when the sequence handle drops.

Batched Forward Geometry

For N=4 sequences each contributing 1 token to the next forward:

Embedding lookup: [N, hidden] (one row per tenant).
Q/K/V matmuls: same shape, computed once for all N tenants.
Per-sequence FA: ONE multi-head FA kernel launches with n_seqs=N; each thread block handles one (sequence, head) pair, reading K/V through that sequence’s block-table indirection. No cross-tenant memory access.
K/V append: per-sequence write to each sequence’s block slot.
FFN: shared across all N tenants (no per-sequence state).
LM head: [N, vocab]; sample independently per tenant.

The substrate that makes this safe is the block-table indirection in the FA + K/V append kernels — every K/V read is gated by the tenant’s own indices, so cross-tenant contamination is structurally impossible.

Isolation and Safety

Per-sequence block-table indirection in FA prevents cross-tenant K/V reads.
Per-sequence KV positions advance independently — tenant A at position 47 doesn’t pollute tenant B at position 14.
The shared K/V pool is allocated through a quota-aware allocator (see Recipe 26 for the multi-tenant preemption layer); the per-tenant max-blocks quota prevents one tenant from starving the pool.

Walkthrough (Implementation Sketch)

1. Create N Sequences From One Engine

let entropy: Arc<dyn EntropySource> = Arc::new(FixedEntropy(0xDEADBEEF_CAFEBABE));
let mut engine = FabricNativeCudaEngine::new_paged(entropy, 256);
let model = engine.load_model(&mut source, cfg).await?;

let prompts = [
    tokenizer.encode("Hello, ")?,
    tokenizer.encode("The quick ")?,
    tokenizer.encode("In a ")?,
    tokenizer.encode("Once upon ")?,
];
let n_seqs = prompts.len();

let mut handles: Vec<FabricNativeCudaSequenceHandle> = (0..n_seqs)
    .map(|_| engine.create_sequence(&model, 2048).unwrap())
    .collect();
for (h, p) in handles.iter_mut().zip(&prompts) {
    engine.submit_prompt(h, p)?;
}

2. Decode All N Tenants Per Step

let samplings: Vec<SamplingParams> =
    (0..n_seqs).map(|_| SamplingParams::greedy(1)).collect();

for step in 0..n_decode_steps {
    let mut handle_refs: Vec<&mut FabricNativeCudaSequenceHandle> =
        handles.iter_mut().collect();
    let toks = engine.decode_step(&mut handle_refs, &samplings).await?;
    // toks[i] is tenant i's emitted token at this step.
}

3. Equivalence Pin (Test Harness)

// Run 1: batched.
let batched_tokens = run_batched(&model, &prompts, n_decode_steps).await;

// Run 2: serial (one sequence at a time on fresh engine).
let serial_tokens = run_serial(&model, &prompts, n_decode_steps).await;

// Assert per-tenant equivalence.
for (seq_i, (batched_seq, serial_seq)) in
    batched_tokens.iter().zip(&serial_tokens).enumerate()
{
    assert_eq!(
        batched_seq, serial_seq,
        "seq {} divergence: batched={:?} serial={:?}",
        seq_i, batched_seq, serial_seq,
    );
}

Verification

Silicon evidence (AWS L4, 2026-05-23):

running 1 test
test cuda_engine_e2e_batched_decode_matches_serial_decode_at_n_eq_4 ... \
    batched_decode_matches_serial_decode_at_n_eq_4 PASS: 4 sequences x 10 steps all agree
ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 5.37s

The equivalence pin is GREEN. Every emitted token across 4 tenants × 10 steps agrees between the batched-decode path and the per-tenant serial-decode path. Combined with Recipe 61’s silicon evidence (serial path is byte-identical to llama.cpp), the per-tenant emitted-token correctness contract holds in batched mode too.

Adjacent evidence from the same run exercises the underlying batched-paged-forward kernel from a different angle: the speculative_batched_verify_paged_matches_sequential_decode_one pin (the substrate behind Recipe 65) emits identical token streams ([2776, 264, 501, 1196, 1588]) from the batched K+1 verify forward and from K+1 sequential decode_one calls. The shared kernel (paged FA + K/V append) is behaving consistently across both consumers.

Run the equivalence pin yourself (requires the batched-decode feature):

FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
    -p grafos-inference-engine \
    --features cuda,batched-decode,test-helpers \
    --test cuda_engine_e2e \
    -- --ignored --nocapture --test-threads=1 \
    cuda_engine_e2e_batched_decode_matches_serial_decode_at_n_eq_4 \
    cuda_engine_e2e_batched_decode_variable_kv_position_matches_serial \
    cuda_engine_e2e_batched_decode_throughput_beats_serial_at_n_eq_4

Failure Modes

Cross-tenant K/V indirection bug. A buggy block-table indirection in FA could read tenant B’s K rows while computing tenant A’s attention. The equivalence pin catches this: tenant A’s emit would differ from its serial-path emit.
Per-sequence position counter drift. If batched decode advances all sequences by the same kv_position instead of per-tenant counters, sequences with different prompt lengths desynchronize. The cuda_engine_e2e_batched_decode_variable_kv_position_matches_serial pin (which uses prompts of different lengths) catches this.
Pool starvation. If the K/V pool allocator runs out of free blocks during batched decode, the engine returns OutOfBlocks. The pin sizes the pool generously (pool_size=256 for 4 sequences); production deployments must size based on worst-case concurrent contexts.
Throughput regression. The cuda_engine_e2e_batched_decode_throughput_beats_serial_at_n_eq_4 pin asserts batched < serial in wall-clock. A regression in the batched kernel or the dispatcher routing M=N to a sub-optimal variant breaks this. Less load-bearing than the equivalence pins, but it catches “we made batched correct but slow.”

Observability

Each batched decode step emits grafos_observe::Event::BatchedDecodeStep with {step, n_seqs, emitted_tokens, per_seq_kv_positions}. The operator dashboard shows live N (how many tenants are sharing) and per-tenant decoded-token-count.

Variations

N > 4. All current pins use N=4. Larger N exercises shared- memory pressure in the FA kernel and the K/V pool allocator’s block packing. Production deployments routinely run N=32 or N=64; the substrate supports it but isn’t pinned here.
Heterogeneous prompts. Mixed-length prompts (one tenant on a 1024-token context, another on a 4-token prompt) exercise the per-sequence position counter. The ..._variable_kv_position_matches_serial pin covers this.
Continuous batching (admission control). Recipe 26 (multi-tenant preemption) layers a continuous-batch scheduler on top of this primitive: it admits prompts dynamically, packs them into ongoing batches, and evicts on quota. The batched decode step here is the kernel-level substrate; continuous batching is the policy layer above.
Cross-tenant prefix sharing. When two tenants share a system prompt prefix, a prefix-cache layer can serve them from the same K/V block slots for the prefix portion and switch to per-tenant slots at the divergence point. Filed as a follow-up recipe.

Why This Is Recipe 66

Recipes 61–64 establish single-engine properties. Recipe 65 composes ACROSS engines (draft + target). Recipe 66 composes ACROSS tenants (N sequences in ONE engine’s forward). Together, 65 and 66 establish two-axis composability — the substrate for production LLM serving on fabric-leased hardware.