Skip to content

Recipe 61: Verifying Engine Output Against a Canonical Reference

Situation

You’re running an LLM inference engine on a fabric-leased GPU. CUDA’s “no error” return code only proves the kernels didn’t crash — they could still produce wrong logits. ML correctness needs an independent oracle. If your engine and a reference implementation diverge on the same GGUF + same prompt, you have a bug; the only question is which side has it.

The pattern: run BOTH engines on identical inputs, assert token-for- token equality on the greedy emit sequence. If the assertion fails, you have a precise reproducer — same inputs, divergent outputs, investigate.

What You Build

A regression pin that loads the same GGUF into your CUDA engine (FabricNativeCudaEngine) and a canonical reference engine (LlamaCppEngine), runs greedy decode for N tokens on the same prompt, and asserts the two emit-sequences are byte-identical. The pin runs on every silicon CI iteration and gates the engine from shipping any change that breaks output correctness.

Building Blocks

  • grafos_inference_engine::FabricNativeCudaEngine — the engine under test.
  • grafos_inference_engine::LlamaCppEngine — the canonical reference (calls into llama.cpp through llama-cpp-sys-2).
  • grafos_inference_engine::InferenceEngine — the trait both engines implement (same load_model / submit_prompt / decode_one surface).
  • grafos_tokenizer::Tokenizer — built from the GGUF’s own metadata so both engines tokenize identically.
  • grafos_gguf_reader::GgufReader — for the tokenizer.

See:

Design

Resource Model

Both engines load the same Vec<u8> GGUF bytes through the WeightSource trait. Each builds its own internal state (tokenizer, KV cache, weights on device or host). The fabric CUDA engine wraps its device weights in LeaseRegion::CudaDevice (broker-owned); the llama.cpp engine manages host memory directly. Both produce Vec<u32> token IDs through the same trait.

Determinism

  • Same prompt → same input token IDs (the tokenizer is identical; both engines call Tokenizer::from_gguf_metadata on the same reader).
  • Greedy sampling (SamplingParams::greedy(1)) → no RNG, no temperature.
  • Both engines run on f32 internally for the matmul / softmax hotpath.

Isolation and Safety

  • Each engine owns its own KV cache. No state crosses between them.
  • The pin tokenizes ONCE and shares the Vec<u32> between both engines — the tokenizer is the only place divergence could be hidden.

Walkthrough (Implementation Sketch)

1. Tokenize Once, Share

fn tokenize_with_gguf(path: &Path, prompt: &str) -> Vec<u32> {
let bytes = std::fs::read(path).expect("read GGUF for tokenizer");
let reader = grafos_gguf_reader::GgufReader::parse(&bytes).expect("parse GGUF");
let tokenizer = grafos_tokenizer::Tokenizer::from_gguf_metadata(&reader)
.expect("build tokenizer from GGUF metadata");
tokenizer.encode(prompt).expect("tokenize prompt")
}
let prompt_ids = tokenize_with_gguf(&model_path, "Hello, world!");

2. Drive Each Engine Greedily

async fn drive(
engine: &mut impl InferenceEngine,
prompt_ids: &[u32],
n_steps: usize,
) -> Vec<i64> {
let model = engine.load_model(/* WeightSource */, cfg).await?;
let mut seq = engine.create_sequence(&model, 1024)?;
engine.submit_prompt(&mut seq, prompt_ids)?;
let sampling = SamplingParams::greedy(n_steps as u32);
let mut out = Vec::with_capacity(n_steps);
for _ in 0..n_steps {
out.push(engine.decode_one(&mut seq, &sampling).await? as i64);
}
out
}
let fabric_cuda = drive(&mut FabricNativeCudaEngine::new(entropy), &prompt_ids, 10).await;
let llama_cpp = drive(&mut LlamaCppEngine::new(entropy)?, &prompt_ids, 10).await;

3. Assert Byte-Exact Equality

assert_eq!(
fabric_cuda, llama_cpp,
"GROUND-TRUTH MISMATCH: FabricNativeCudaEngine output != LlamaCppEngine output"
);

Verification (Silicon)

On AWS L4 (i-0b00f3d6f383200d4, 2026-05-22, commit 48791ac1):

[engine-quality] tokenized "Hello, world!" into 4 ids: [9707, 11, 1879, 0]
[engine-quality] fabric_cuda emitted: [358, 2776, 264, 3162, 15754, 323, 358, 2776, 5023, 3238]
[engine-quality] llama_cpp emitted: [358, 2776, 264, 3162, 15754, 323, 358, 2776, 5023, 3238]
[engine-quality] no divergence over 10 steps

10/10 byte-identical greedy tokens between the two engines. Tokens decode to I'm a software developer and I'm currently working.

Run the pin yourself:

Terminal window
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
-p grafos-inference-engine \
--features cuda,llama-cpp \
--test qwen_05b_cuda_vs_llama_cpp \
-- --ignored --nocapture

Failure Modes

  • Argmax-flip from f32 implementation drift. Two correct engines computing the same logits in f32 can independently round at the same step; if their logits straddle the argmax boundary, one argmax flips. This is INHERENT, not a bug. Choose a prompt where the next-token argmax has a comfortable margin for N tokens (e.g., “Hello, world!” stays aligned for ~19 tokens on Qwen 0.5B; “Hello,” diverges at step 2).
  • Tied-embedding vs explicit LM head mismatch. Some GGUFs omit output.weight and rely on token_embd for the LM head. If only one engine handles this case, the LM head logits differ → divergence from step 0. Both engines must agree on the tied/untied policy.
  • KV cache state contamination. A retried decode after a failed submit_prompt may inherit stale K/V rows. Always start each pin with a fresh sequence handle.
  • Different sampling sources. If you pass greedy(N) with N>1 to one engine and greedy(1) in a loop to the other, semantic equivalence holds but the engines may take different code paths. Stay on decode_one per step for symmetry.

Observability

Each decode_one call should emit a grafos_observe::Event::Decoded record with {step, emitted_token, argmax_logit}. Pair the events from both engines by step to surface divergence the moment it appears, not at the end of the loop.

Variations

  • Different reference engine. Replace LlamaCppEngine with vLLM, HuggingFace transformers, or another implementation. The pin shape doesn’t change; only the comparator does.
  • Different prompt length. Increase n_steps to validate longer trajectories. Most divergences hide at later steps.
  • Different model. The pin parameterizes over GGUF path — swap Qwen 0.5B for Qwen 1.5B, Llama 3.2 3B, or any compatible GGUF. Each model has its own argmax-stable trajectory length.
  • Per-layer activation diff. When step-N divergence occurs, layer-bisect by extracting hidden activations from both engines and asserting layer-wise equality. Slice per-layer activation comparison is the archetype for this debugging mode.

Why This Is Recipe 61

This is the foundational verification recipe for fabric-leased inference. Recipes 62 (memory reduction), 63 (mid-decode revoke), 64 (FENCED state), 65 (speculative decode), and 66 (multi-tenant batched decode) all assume the engine produces correct output. Without this pin green, every other claim about the engine is empty.