Recipe 61: Verifying Engine Output Against a Canonical Reference

Situation

You’re running an LLM inference engine on a fabric-leased GPU. CUDA’s “no error” return code only proves the kernels didn’t crash — they could still produce wrong logits. ML correctness needs an independent oracle. If your engine and a reference implementation diverge on the same GGUF + same prompt, you have a bug; the only question is which side has it.

The pattern: run BOTH engines on identical inputs, assert token-for- token equality on the greedy emit sequence. If the assertion fails, you have a precise reproducer — same inputs, divergent outputs, investigate.

What You Build

A regression pin that loads the same GGUF into your CUDA engine (FabricNativeCudaEngine) and a canonical reference engine (LlamaCppEngine), runs greedy decode for N tokens on the same prompt, and asserts the two emit-sequences are byte-identical. The pin runs on every silicon CI iteration and gates the engine from shipping any change that breaks output correctness.

Building Blocks

grafos_inference_engine::FabricNativeCudaEngine — the engine under test.
grafos_inference_engine::LlamaCppEngine — the canonical reference (calls into llama.cpp through llama-cpp-sys-2).
grafos_inference_engine::InferenceEngine — the trait both engines implement (same load_model / submit_prompt / decode_one surface).
grafos_tokenizer::Tokenizer — built from the GGUF’s own metadata so both engines tokenize identically.
grafos_gguf_reader::GgufReader — for the tokenizer.

See:

Design

Resource Model

Both engines load the same Vec<u8> GGUF bytes through the WeightSource trait. Each builds its own internal state (tokenizer, KV cache, weights on device or host). The fabric CUDA engine wraps its device weights in LeaseRegion::CudaDevice (broker-owned); the llama.cpp engine manages host memory directly. Both produce Vec<u32> token IDs through the same trait.

Determinism

Same prompt → same input token IDs (the tokenizer is identical; both engines call Tokenizer::from_gguf_metadata on the same reader).
Greedy sampling (SamplingParams::greedy(1)) → no RNG, no temperature.
Both engines run on f32 internally for the matmul / softmax hotpath.

Isolation and Safety

Each engine owns its own KV cache. No state crosses between them.
The pin tokenizes ONCE and shares the Vec<u32> between both engines — the tokenizer is the only place divergence could be hidden.

Walkthrough (Implementation Sketch)

fn tokenize_with_gguf(path: &Path, prompt: &str) -> Vec<u32> {
    let bytes = std::fs::read(path).expect("read GGUF for tokenizer");
    let reader = grafos_gguf_reader::GgufReader::parse(&bytes).expect("parse GGUF");
    let tokenizer = grafos_tokenizer::Tokenizer::from_gguf_metadata(&reader)
        .expect("build tokenizer from GGUF metadata");
    tokenizer.encode(prompt).expect("tokenize prompt")
}

let prompt_ids = tokenize_with_gguf(&model_path, "Hello, world!");

2. Drive Each Engine Greedily

async fn drive(
    engine: &mut impl InferenceEngine,
    prompt_ids: &[u32],
    n_steps: usize,
) -> Vec<i64> {
    let model = engine.load_model(/* WeightSource */, cfg).await?;
    let mut seq = engine.create_sequence(&model, 1024)?;
    engine.submit_prompt(&mut seq, prompt_ids)?;
    let sampling = SamplingParams::greedy(n_steps as u32);
    let mut out = Vec::with_capacity(n_steps);
    for _ in 0..n_steps {
        out.push(engine.decode_one(&mut seq, &sampling).await? as i64);
    }
    out
}

let fabric_cuda = drive(&mut FabricNativeCudaEngine::new(entropy), &prompt_ids, 10).await;
let llama_cpp   = drive(&mut LlamaCppEngine::new(entropy)?, &prompt_ids, 10).await;

3. Assert Byte-Exact Equality

assert_eq!(
    fabric_cuda, llama_cpp,
    "GROUND-TRUTH MISMATCH: FabricNativeCudaEngine output != LlamaCppEngine output"
);

Verification (Silicon)

On AWS L4 (i-0b00f3d6f383200d4, 2026-05-22, commit 48791ac1):

[engine-quality] tokenized "Hello, world!" into 4 ids: [9707, 11, 1879, 0]
[engine-quality] fabric_cuda emitted: [358, 2776, 264, 3162, 15754, 323, 358, 2776, 5023, 3238]
[engine-quality] llama_cpp   emitted: [358, 2776, 264, 3162, 15754, 323, 358, 2776, 5023, 3238]
[engine-quality] no divergence over 10 steps

10/10 byte-identical greedy tokens between the two engines. Tokens decode to I'm a software developer and I'm currently working.

Run the pin yourself:

FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
    -p grafos-inference-engine \
    --features cuda,llama-cpp \
    --test qwen_05b_cuda_vs_llama_cpp \
    -- --ignored --nocapture

Failure Modes

Argmax-flip from f32 implementation drift. Two correct engines computing the same logits in f32 can independently round at the same step; if their logits straddle the argmax boundary, one argmax flips. This is INHERENT, not a bug. Choose a prompt where the next-token argmax has a comfortable margin for N tokens (e.g., “Hello, world!” stays aligned for ~19 tokens on Qwen 0.5B; “Hello,” diverges at step 2).
Tied-embedding vs explicit LM head mismatch. Some GGUFs omit output.weight and rely on token_embd for the LM head. If only one engine handles this case, the LM head logits differ → divergence from step 0. Both engines must agree on the tied/untied policy.
KV cache state contamination. A retried decode after a failed submit_prompt may inherit stale K/V rows. Always start each pin with a fresh sequence handle.
Different sampling sources. If you pass greedy(N) with N>1 to one engine and greedy(1) in a loop to the other, semantic equivalence holds but the engines may take different code paths. Stay on decode_one per step for symmetry.

Observability

Each decode_one call should emit a grafos_observe::Event::Decoded record with {step, emitted_token, argmax_logit}. Pair the events from both engines by step to surface divergence the moment it appears, not at the end of the loop.

Variations

Different reference engine. Replace LlamaCppEngine with vLLM, HuggingFace transformers, or another implementation. The pin shape doesn’t change; only the comparator does.
Different prompt length. Increase n_steps to validate longer trajectories. Most divergences hide at later steps.
Different model. The pin parameterizes over GGUF path — swap Qwen 0.5B for Qwen 1.5B, Llama 3.2 3B, or any compatible GGUF. Each model has its own argmax-stable trajectory length.
Per-layer activation diff. When step-N divergence occurs, layer-bisect by extracting hidden activations from both engines and asserting layer-wise equality. Slice per-layer activation comparison is the archetype for this debugging mode.

Why This Is Recipe 61

This is the foundational verification recipe for fabric-leased inference. Recipes 62 (memory reduction), 63 (mid-decode revoke), 64 (FENCED state), 65 (speculative decode), and 66 (multi-tenant batched decode) all assume the engine produces correct output. Without this pin green, every other claim about the engine is empty.