Recipe 63: Handling Mid-Kernel Lease Revocation in a Decode Loop
Situation
A fabric resource isn’t really a fabric resource if the broker can only reclaim it between top-level RPCs. LLM forward passes run hundreds of kernel launches per token (matmul / RoPE / norm / FA / softmax / lookup, × N layers × decode-loop). If revocation only took effect between full forward passes, a hung tenant would block reclaim for the entire forward window — meaning worst-case latency to free is bounded by the slowest tenant’s slowest forward, not by the broker’s policy.
The pattern: every kernel-launch site polls the revocation token
before dispatching. The poll is a single Relaxed-ordered
AtomicBool load — cheap enough to inline at every launch site —
and returns a typed error if the broker has flipped the token. The
engine never starts a kernel against memory the broker considers
reclaimed; the worst-case bound is one kernel-launch latency (tens
of µs), not one forward pass.
What You Build
A decode loop that observes lease revocation at every kernel-launch
boundary, returning a typed CudaForwardError::Revoked to the
caller within tens of microseconds of the broker’s token flip — even
if the flip happens mid-token, between two adjacent matmul launches.
Recipe 63 covers the speed of the bail; Recipe 64 covers the
state after.
Building Blocks
grafos_tensor_kernels_cuda::lease_region::LeaseRegion::CudaDevice— broker-owned device memory with a revocation token.grafos_inference_engine::cuda_forward::poll_revoke_or_return!— the inline macro at every kernel-launch site.LeaseRevocationView— a cheap-to-clone read-side handle the forward path holds for the duration of one decode call.CudaForwardError::Revoked— the typed error variant the macro returns when the token has been flipped.
See:
Design
Resource Model
Each weight tensor sits behind a LeaseRegion::CudaDevice carrying
a RevocationToken. The broker holds one side of the token; the
engine holds a LeaseRevocationView (a read-side handle) for the
duration of any forward pass. Flipping the broker side makes the
view’s atomic load return “revoked” — no IPC, no syscall, just a
cache-line read.
Poll Placement
poll_revoke_or_return! appears immediately before every kernel
launch in forward_impl_with_shape, forward_impl_paged, and
batched_decode_paged_*. There are ~30 sites per forward. The
contract: at every site, EITHER the launch proceeds (token live)
OR the macro returns CudaForwardError::Revoked without launching.
No site reads/writes device memory between the poll and the launch
that could race with the broker.
Isolation and Safety
- The view is cloned-out per forward call from the engine’s broker
handle. It cannot outlive the lease — the broker-side
Dropinvalidates it. - The poll is
Relaxedordering. This is sufficient because the contract is “if the broker flipped before this poll, observe it” — full synchronization isn’t required since the engine’s next poll catches a later flip. - The macro returns BEFORE launching. There is no in-flight kernel against memory the broker considers reclaimed.
Walkthrough (Implementation Sketch)
1. Engine-Side Macro at Every Launch
// From cuda_forward.rs — repeated ~30 times per forward pass.poll_revoke_or_return!(revoke_view);flash_attention_multi_head_fused_q_tc_f32_cuda(/* args */)?;
poll_revoke_or_return!(revoke_view);rmsnorm_f32_cuda(/* args */)?;
poll_revoke_or_return!(revoke_view);dispatch_matmul_f32_b_kn_cuda(/* args */)?;2. Macro Expansion
macro_rules! poll_revoke_or_return { ($view:expr) => { if $view.is_revoked() { return Err(CudaForwardError::Revoked { lease_id: $view.lease_id(), }); } };}3. Broker-Side Revoke
// In application code (the broker), e.g. on TTL expiry or tenant kill.let lease_handle = /* obtained at lease creation */;lease_handle.revoke(); // flips the AtomicBool the engine view reads.
// On the engine's next forward call, the next poll site returns// CudaForwardError::Revoked. No further kernels launch.Verification (Silicon)
On AWS L4 (i-0b00f3d6f383200d4, 2026-05-23, commit b065c11c):
--- Step 6: intra-kernel-revoke wall-clock (THE categorical claim) ---revoke_sync wall-clock: 0.00 mstest cuda_engine_e2e_intra_kernel_revoke_bounds_revoke_wall_clock ... okSTEP 6 EXIT: 00.00 ms is what std::time::Instant::now().elapsed() reported at
millisecond granularity. The actual revoke-to-bail interval is some
sub-millisecond value — bounded above by one kernel-launch latency
(tens of µs for the small kernels in a 0.5B decode path).
Run the pin yourself:
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \cargo test --release \ -p grafos-inference-engine \ --features cuda,test-helpers \ --test cuda_engine_e2e \ -- --ignored --nocapture \ cuda_engine_e2e_intra_kernel_revoke_boundsFailure Modes
- Missed poll site. A new kernel-launch site without the macro
is invisible to the broker until the next poll-protected site
runs. Symptom: rare, near-impossible-to-reproduce races where the
engine returns one extra token after revocation. Mitigation:
every kernel-launch site must be reviewed for the macro; the
smoke test
cuda_engine_e2e_intra_kernel_revoke_boundsexercises the contract under timing pressure but won’t catch every missing site. - CUDA Graph replay paths. When CUDA Graphs are enabled
(
feature = "cuda-graph"), an entire stream of kernels launches as one atomic unit. The poll-before-launch contract doesn’t trivially extend through graph replay — the graph runs to completion. For revocation safety under graph replay, the surrounding loop (between replays) is the cancel granularity. A cleaner intra-graph cancel is a follow-up. - In-flight kernel completion after revoke. The macro stops future launches but doesn’t cancel the kernel already executing when the flip happens. That kernel writes its output to the engine’s scratch buffers, not to broker-owned memory outside the lease — the next poll site catches the revoke before the engine reads from any new broker buffer. The contract is “no kernel launches against revoked memory,” not “no kernels running at all.”
Observability
Every macro return emits a grafos_observe::Event::ForwardRevoked
record: {lease_id, kernel_site, kv_position}. Operators can
correlate this with the broker’s flip event to measure end-to-end
revoke latency including the broker → engine signal path.
Variations
- Per-tensor revocation granularity. Today every kernel-launch poll checks the engine-level handle. Per-tensor (per-LeaseRegion) polls would let the broker revoke individual weights without killing the engine — useful for hot-swapping a single layer in research workloads. Filed as a follow-up.
- Cooperative cancellation token from outside the fabric. Some
applications need cancellation from the request-handling layer
(e.g., HTTP client disconnect). Passing a second cancellation
view alongside the lease view lets
poll_revoke_or_return!cover both signals with one atomic load. - Latency SLO enforcement. The test prints wall-clock but
doesn’t enforce an upper bound. A
revoke_wall_clock_p99_uspin would catch regressions in the path between broker flip and engine return; filed as a perf-guard follow-up.
Why This Is Recipe 63
Recipes 61 and 62 establish that the engine produces correct output with efficient memory. Recipe 63 establishes that the broker can take that memory back at any time — intra-decode, not between-decodes. Without this property, fabric leases on the inference path would be too coarse-grained to support real multi-tenant serving (Recipe 66).