Recipe 62: Loading an LLM Without f32-at-Load Memory Blowup
Situation
You have a quantized GGUF file on disk — say a 491 MiB Qwen 2.5 0.5B Q4_K_M. The default GGUF loader path materializes a full f32 mirror of every quantized tensor at load time. Now your 491 MiB on-disk model occupies ~1.8 GiB of device memory. The on-disk quantization work is wasted; you’ve burned the encoding back into f32 before the GPU sees its first matmul.
The fabric approach: keep quantized tensors PACKED on device, dequant inside the matmul/lookup kernels per-block during the hot loop. The on-disk format becomes the device-resident format. Same output tokens as the f32-dequant baseline; ~4.85× less device memory.
What You Build
An engine load path that, under the quant-weights feature, routes
every Q4_K / Q5_0 / Q6_K / Q8_0 tensor through a packed-load helper
instead of the f32 dequant-at-load default. The engine forward
dispatches to matmul kernels that take packed bytes as input and
dequant per-block during the inner loop. The result is a model that
holds its entire weight set in less device memory than the GGUF
takes on disk.
Building Blocks
grafos_inference_engine::cuda_backend::CudaLlamaWeights— weight storage with packed-bytes pointers alongside f32 pointers.grafos_tensor_kernels_cuda::matmul_dispatch::dispatch_matmul_f32_q*_b_kn_cuda— one dispatcher per quant format, all sharing the same(a, b_packed, c, m, k, n, ctx)signature.grafos_tensor_kernels_cuda::embedding_lookup_q5_0_f32_cuda— the lookup-style variant fortoken_embd.grafos_inference_engine::cuda_backend::set_disable_q*_packed_load— runtime toggles for A/B-testing packed vs f32 paths in one process (the load-bearing parity pin uses these).LeaseRegion::CudaDevice— broker-owned wrapper, same for the packed-bytes buffer as for the f32 buffer.
See:
Design
Resource Model
Each per-layer tensor (and the LM head + token_embd) has up to three candidate pointer slots on the engine’s weight struct:
pub attn_output: *mut f32, // f32-dequant-at-loadpub attn_output_packed: *mut u8, // Q4_K packed bytespub attn_output_packed_q5_0: *mut u8, // Q5_0 packed bytesExactly one is non-null per layer after load. The loader chooses
based on the GGUF tensor’s actual dtype + an alignment predicate
(e.g., Q4_K needs K % 256 == 0; Q5_0 needs K % 32 == 0). The
forward path checks the packed pointers first and dispatches to the
matching kernel; otherwise it falls through to the f32 dispatcher.
Decision Lattice
For each candidate weight tensor, the loader runs:
Q4_K + K%256==0 → load_q4_k_packed_to_deviceQ6_K + K%256==0 → load_q6_k_packed_to_deviceQ5_0 + K%32==0 → load_q5_0_packed_to_deviceQ8_0 + K%32==0 → load_q8_0_packed_to_deviceotherwise → load_one_tensor_to_device (f32 dequant-at-load)The fallback path is preserved verbatim; new quant formats add branches but never remove the f32 safety net.
Isolation and Safety
- Each packed-bytes buffer is wrapped in a
LeaseRegion::CudaDeviceidentical to the f32 case: broker-owned, revocation-token-bearing, Drop callscuda_free_bytes. The packed path inherits all of Recipe 63 / 64’s revocation semantics for free. - The dequant decode inside each matmul kernel is byte-equivalent
to the standalone
grafos_dequantize_q*_cudakernel — same formulas, same accumulation order at the per-element level. The parity pin (see Verification) is the load-bearing check.
Walkthrough (Implementation Sketch)
1. Loader Branch at the Tensor Site
#[cfg(feature = "quant-weights")]{ let dtype = descriptors .iter() .find(|d| d.name == "blk.0.attn_output.weight") .map(|d| d.dtype); let q5_0_eligible = !disable_q5_0_packed_load() && dtype == Some(GgufType::Q5_0) && q_dim % Q5_0_BLOCK_ELEMENTS == 0; if q5_0_eligible { load_q5_0_packed_to_device( &mut loader, &name, q_dim * h, &mut weights.layers[layer_idx].attn_output_packed_q5_0, &mut weight_regions)?; } else { load_one_tensor_to_device(/* f32 fallback */)?; }}2. Forward Dispatches to the Matching Kernel
if !layer.attn_output_packed_q5_0.is_null() { dispatch_matmul_f32_q5_0_b_kn_cuda( attn_out.as_ptr(), layer.attn_output_packed_q5_0, projected.as_mut_ptr(), seq_len, q_dim, h, ctx, )?;} else if !layer.attn_output_packed.is_null() { dispatch_matmul_f32_q4k_b_kn_cuda(/* Q4_K branch */)?;} else { dispatch_matmul_f32_b_kn_cuda(/* f32 fallback */)?;}3. Parity Pin: Load Twice, Compare
let tokens_f32 = run_with_loader_mode(&bytes, &prompt, /*disable_packed=*/true);let tokens_packed = run_with_loader_mode(&bytes, &prompt, /*disable_packed=*/false);let hamming = tokens_f32.iter().zip(&tokens_packed).filter(|(a,b)| a != b).count();assert_eq!(hamming, 0, "packed != f32: {} divergent tokens", hamming);Verification (Silicon)
On AWS L4 (i-0ec3ef3ac28d68235, 2026-05-22, commit 7af52b85):
f32 tokens: [358, 1079, 4460, 311, 1855]packed tokens: [358, 1079, 4460, 311, 1855]hamming distance: 0 / 5STEP 5 EXIT: 0Memory accounting (Qwen 2.5 0.5B Q4_K_M, hidden=896, intermediate=4864, 24 layers, vocab=151936):
| Tensor | GGUF dtype | f32 (MiB) | Packed (MiB) | Saved (MiB) |
|---|---|---|---|---|
| token_embd | Q5_0 | 519 | 89 | 430 |
| output (LM head) | Q8_0 | 519 | 138 | 381 |
| attn_v × 24 | Q8_0 | 10.5 | 2.8 | 7.7 |
| attn_output × 24 | Q5_0 | 73.5 | 12.6 | 60.9 |
| attn_k × 24 | Q5_0 | 10.5 | 1.8 | 8.7 |
| ffn_gate × 24 | Q5_0 | 315 | 54 | 261 |
| ffn_up × 24 | Q5_0 | 315 | 54 | 261 |
| ffn_down × 24 | Q6_K mix | 67 | 25 | 42 |
| Total | 1830 | 377 | ~1453 |
GGUF on disk: 491 MiB. Packed engine footprint: 377 MiB. The engine holds less device memory than the model takes on disk — the entire on-disk quantization work pays back at inference time.
Run the parity pin yourself:
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \cargo test --release \ -p grafos-inference-engine \ --features cuda,quant-weights,test-helpers \ --test cuda_engine_e2e \ -- --ignored --nocapture --test-threads=1 \ qwen_cuda_quant_weights_matches_f32Failure Modes
- Block-alignment mismatch. Q4_K requires
K % 256 == 0; Q5_0 requiresK % 32 == 0; mixing them up corrupts the dequant walk. The loader logs an “ineligible” warning with the modulus and falls back to f32 — that warning is your debugging starting point. - OnceLock’d warnings hide the second-run case. The loader’s
one-time stderr
eprintln!fires per-static-symbol. When the parity pin disables packed loading on run 1 and enables it on run 2, the run-1 warnings fire and run-2 silently engages the packed path. The Hamming distance is what proves run 2 took the packed path; the absence of warnings on run 2 is consistent with it but not a positive signal on its own. - Mutual-exclusion pointer aliasing. Forgetting to set the f32
pointer to NULL after populating the packed pointer leaves both
live. The forward path’s “check packed first” cascade returns
the packed result, but Drop frees both — harmless but wastes
device memory. The loader guarantees exactly-one-non-null via
the
elsebranches. - New quant format added to GGUF, no loader branch. GGUF tensor reads succeed (dtype is parsed) but the loader hits the fallback with a “not packed-eligible” message. Diagnose by inspecting the GGUF dtype per tensor; add a new packed loader if the format matters for your model.
Observability
Each packed-load emits a per-tensor LoadedPackedTensor event
through grafos_observe: {layer, slot, dtype, packed_bytes, saved_vs_f32_bytes}. The aggregate “device weight footprint” is
the sum of these events plus the surviving f32 tensors. Operators
running production fleets care about this aggregate; it determines
how many models fit per GPU.
Variations
- Different model. The decision lattice is per-tensor based on the GGUF’s actual dtype, so Llama 3.2, Mistral, and Phi families work the same way. Each model has its own quant mix; the lattice picks the best packed format available.
- Heterogeneous quant policy per tenant. Different tenants on
the same engine could request different quant-relaxation levels.
The
disable_q*_packed_loadtoggles flip per-thread; per-tenant overrides slot in here. - Tensor-core dequant matmul (TF32 WMMA). The Q4_K and Q6_K matmul dispatchers have TF32 WMMA arms for prefill shapes (M >= 16); enabling them for Q5_0 and Q8_0 is a follow-up. The parity pin is invariant to the kernel arm chosen — it tests output equivalence, not throughput.
- No quant-weights feature. Build without
--features quant-weightsto verify the f32 fallback path is the entire loader behavior (i.e., the feature-gate isolation holds). Every quant variant preserves this build configuration.
Why This Is Recipe 62
Recipe 61 verifies the engine produces correct output. Recipe 62 demonstrates it produces correct output WHILE using less device memory than the model takes on disk. Together they unlock the density story for fabric-leased inference: more tenants per GPU, larger models per GPU, smaller blast radius per failure.