Skip to content

Recipe 62: Loading an LLM Without f32-at-Load Memory Blowup

Situation

You have a quantized GGUF file on disk — say a 491 MiB Qwen 2.5 0.5B Q4_K_M. The default GGUF loader path materializes a full f32 mirror of every quantized tensor at load time. Now your 491 MiB on-disk model occupies ~1.8 GiB of device memory. The on-disk quantization work is wasted; you’ve burned the encoding back into f32 before the GPU sees its first matmul.

The fabric approach: keep quantized tensors PACKED on device, dequant inside the matmul/lookup kernels per-block during the hot loop. The on-disk format becomes the device-resident format. Same output tokens as the f32-dequant baseline; ~4.85× less device memory.

What You Build

An engine load path that, under the quant-weights feature, routes every Q4_K / Q5_0 / Q6_K / Q8_0 tensor through a packed-load helper instead of the f32 dequant-at-load default. The engine forward dispatches to matmul kernels that take packed bytes as input and dequant per-block during the inner loop. The result is a model that holds its entire weight set in less device memory than the GGUF takes on disk.

Building Blocks

  • grafos_inference_engine::cuda_backend::CudaLlamaWeights — weight storage with packed-bytes pointers alongside f32 pointers.
  • grafos_tensor_kernels_cuda::matmul_dispatch::dispatch_matmul_f32_q*_b_kn_cuda — one dispatcher per quant format, all sharing the same (a, b_packed, c, m, k, n, ctx) signature.
  • grafos_tensor_kernels_cuda::embedding_lookup_q5_0_f32_cuda — the lookup-style variant for token_embd.
  • grafos_inference_engine::cuda_backend::set_disable_q*_packed_load — runtime toggles for A/B-testing packed vs f32 paths in one process (the load-bearing parity pin uses these).
  • LeaseRegion::CudaDevice — broker-owned wrapper, same for the packed-bytes buffer as for the f32 buffer.

See:

Design

Resource Model

Each per-layer tensor (and the LM head + token_embd) has up to three candidate pointer slots on the engine’s weight struct:

pub attn_output: *mut f32, // f32-dequant-at-load
pub attn_output_packed: *mut u8, // Q4_K packed bytes
pub attn_output_packed_q5_0: *mut u8, // Q5_0 packed bytes

Exactly one is non-null per layer after load. The loader chooses based on the GGUF tensor’s actual dtype + an alignment predicate (e.g., Q4_K needs K % 256 == 0; Q5_0 needs K % 32 == 0). The forward path checks the packed pointers first and dispatches to the matching kernel; otherwise it falls through to the f32 dispatcher.

Decision Lattice

For each candidate weight tensor, the loader runs:

Q4_K + K%256==0 → load_q4_k_packed_to_device
Q6_K + K%256==0 → load_q6_k_packed_to_device
Q5_0 + K%32==0 → load_q5_0_packed_to_device
Q8_0 + K%32==0 → load_q8_0_packed_to_device
otherwise → load_one_tensor_to_device (f32 dequant-at-load)

The fallback path is preserved verbatim; new quant formats add branches but never remove the f32 safety net.

Isolation and Safety

  • Each packed-bytes buffer is wrapped in a LeaseRegion::CudaDevice identical to the f32 case: broker-owned, revocation-token-bearing, Drop calls cuda_free_bytes. The packed path inherits all of Recipe 63 / 64’s revocation semantics for free.
  • The dequant decode inside each matmul kernel is byte-equivalent to the standalone grafos_dequantize_q*_cuda kernel — same formulas, same accumulation order at the per-element level. The parity pin (see Verification) is the load-bearing check.

Walkthrough (Implementation Sketch)

1. Loader Branch at the Tensor Site

#[cfg(feature = "quant-weights")]
{
let dtype = descriptors
.iter()
.find(|d| d.name == "blk.0.attn_output.weight")
.map(|d| d.dtype);
let q5_0_eligible = !disable_q5_0_packed_load()
&& dtype == Some(GgufType::Q5_0)
&& q_dim % Q5_0_BLOCK_ELEMENTS == 0;
if q5_0_eligible {
load_q5_0_packed_to_device(
&mut loader, &name, q_dim * h,
&mut weights.layers[layer_idx].attn_output_packed_q5_0,
&mut weight_regions)?;
} else {
load_one_tensor_to_device(/* f32 fallback */)?;
}
}

2. Forward Dispatches to the Matching Kernel

if !layer.attn_output_packed_q5_0.is_null() {
dispatch_matmul_f32_q5_0_b_kn_cuda(
attn_out.as_ptr(),
layer.attn_output_packed_q5_0,
projected.as_mut_ptr(),
seq_len, q_dim, h, ctx,
)?;
} else if !layer.attn_output_packed.is_null() {
dispatch_matmul_f32_q4k_b_kn_cuda(/* Q4_K branch */)?;
} else {
dispatch_matmul_f32_b_kn_cuda(/* f32 fallback */)?;
}

3. Parity Pin: Load Twice, Compare

let tokens_f32 = run_with_loader_mode(&bytes, &prompt, /*disable_packed=*/true);
let tokens_packed = run_with_loader_mode(&bytes, &prompt, /*disable_packed=*/false);
let hamming = tokens_f32.iter().zip(&tokens_packed).filter(|(a,b)| a != b).count();
assert_eq!(hamming, 0, "packed != f32: {} divergent tokens", hamming);

Verification (Silicon)

On AWS L4 (i-0ec3ef3ac28d68235, 2026-05-22, commit 7af52b85):

f32 tokens: [358, 1079, 4460, 311, 1855]
packed tokens: [358, 1079, 4460, 311, 1855]
hamming distance: 0 / 5
STEP 5 EXIT: 0

Memory accounting (Qwen 2.5 0.5B Q4_K_M, hidden=896, intermediate=4864, 24 layers, vocab=151936):

TensorGGUF dtypef32 (MiB)Packed (MiB)Saved (MiB)
token_embdQ5_051989430
output (LM head)Q8_0519138381
attn_v × 24Q8_010.52.87.7
attn_output × 24Q5_073.512.660.9
attn_k × 24Q5_010.51.88.7
ffn_gate × 24Q5_031554261
ffn_up × 24Q5_031554261
ffn_down × 24Q6_K mix672542
Total1830377~1453

GGUF on disk: 491 MiB. Packed engine footprint: 377 MiB. The engine holds less device memory than the model takes on disk — the entire on-disk quantization work pays back at inference time.

Run the parity pin yourself:

Terminal window
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
-p grafos-inference-engine \
--features cuda,quant-weights,test-helpers \
--test cuda_engine_e2e \
-- --ignored --nocapture --test-threads=1 \
qwen_cuda_quant_weights_matches_f32

Failure Modes

  • Block-alignment mismatch. Q4_K requires K % 256 == 0; Q5_0 requires K % 32 == 0; mixing them up corrupts the dequant walk. The loader logs an “ineligible” warning with the modulus and falls back to f32 — that warning is your debugging starting point.
  • OnceLock’d warnings hide the second-run case. The loader’s one-time stderr eprintln! fires per-static-symbol. When the parity pin disables packed loading on run 1 and enables it on run 2, the run-1 warnings fire and run-2 silently engages the packed path. The Hamming distance is what proves run 2 took the packed path; the absence of warnings on run 2 is consistent with it but not a positive signal on its own.
  • Mutual-exclusion pointer aliasing. Forgetting to set the f32 pointer to NULL after populating the packed pointer leaves both live. The forward path’s “check packed first” cascade returns the packed result, but Drop frees both — harmless but wastes device memory. The loader guarantees exactly-one-non-null via the else branches.
  • New quant format added to GGUF, no loader branch. GGUF tensor reads succeed (dtype is parsed) but the loader hits the fallback with a “not packed-eligible” message. Diagnose by inspecting the GGUF dtype per tensor; add a new packed loader if the format matters for your model.

Observability

Each packed-load emits a per-tensor LoadedPackedTensor event through grafos_observe: {layer, slot, dtype, packed_bytes, saved_vs_f32_bytes}. The aggregate “device weight footprint” is the sum of these events plus the surviving f32 tensors. Operators running production fleets care about this aggregate; it determines how many models fit per GPU.

Variations

  • Different model. The decision lattice is per-tensor based on the GGUF’s actual dtype, so Llama 3.2, Mistral, and Phi families work the same way. Each model has its own quant mix; the lattice picks the best packed format available.
  • Heterogeneous quant policy per tenant. Different tenants on the same engine could request different quant-relaxation levels. The disable_q*_packed_load toggles flip per-thread; per-tenant overrides slot in here.
  • Tensor-core dequant matmul (TF32 WMMA). The Q4_K and Q6_K matmul dispatchers have TF32 WMMA arms for prefill shapes (M >= 16); enabling them for Q5_0 and Q8_0 is a follow-up. The parity pin is invariant to the kernel arm chosen — it tests output equivalence, not throughput.
  • No quant-weights feature. Build without --features quant-weights to verify the f32 fallback path is the entire loader behavior (i.e., the feature-gate isolation holds). Every quant variant preserves this build configuration.

Why This Is Recipe 62

Recipe 61 verifies the engine produces correct output. Recipe 62 demonstrates it produces correct output WHILE using less device memory than the model takes on disk. Together they unlock the density story for fabric-leased inference: more tenants per GPU, larger models per GPU, smaller blast radius per failure.