Skip to content

Correctness & memory

Engine output equivalence, memory-efficient loading, multi-engine composition.

When to read this sub-group

You are running an inference engine and need to prove that what it outputs is the right answer, and that it does so without burning more device memory than it has to. These recipes are about measurement — equivalence pins against canonical references and memory-accounting against the on-disk model size.

Suggested order

  1. Verifying Engine Output Against a Canonical Reference — the foundational equivalence pin against llama.cpp. Run this before trusting any other claim about your engine.
  2. Loading an LLM Without f32-at-Load Memory Blowup — quant-in-matmul. Cuts device-memory footprint to less than the on-disk GGUF size.
  3. Composing Two Fabric-Leased Engines for Speculative Decode — multi-engine composition with a correctness contract (target-only greedy parity) and a throughput target (acceptance rate).

What’s not here

Sharing inference between tenants. See shared inference. Recovery from mid-decode lease revocation. See revocation and recovery.