Recipe 67: Hot-Rebind Inference Continuity After a Lease Revocation
Situation
Recipes 63 and 64 establish how the engine stops cleanly when a lease is revoked. But “stop cleanly” isn’t the production contract. The production contract is “the SURVIVING tenants keep serving, and the affected tenant resumes on a fresh lease without restarting the engine.” A single tenant’s crash, TTL expiry, or quota-exceed event can’t take down inference for the rest of the shared GPU.
This recipe builds the recovery half of the revocation pattern: when the engine surfaces a typed revocation or fence error, the serving harness drops the affected tenant’s sequence handle, re-admits the tenant through the scheduler, and continues serving the other tenants without interruption. No engine restart, no full reload of weights.
What You Build
A serving harness around ContinuousBatchScheduler that:
- Catches per-step
EngineError::Fatal("lease revoked: intra-kernel poll")originating fromCudaForwardError::IntraKernelRevoke. - Catches per-request
MissingWeighterrors that indicate a tensor backing a sequence handle has been freed. - Tracks which tenant’s lease was the trigger (the harness, not the engine, owns this mapping).
- Drops the affected sequence handle, re-submits the tenant’s
prompt through the scheduler, and lets surviving tenants
continue to decode through the same
ContinuousBatchSchedulerinstance.
Building Blocks
grafos_inference_engine::FabricNativeCudaEngine— the GPU engine; shared across all tenants on this GPU.grafos_inference_engine::cuda_forward::CudaForwardError— the typed-error surface that carriesIntraKernelRevokeandMissingWeightvariants.grafos_inference_engine::continuous_batch::ContinuousBatchScheduler— the per-engine scheduler that admits, batches, and reaps requests. Recipe 68 covers its full surface.grafos_inference_engine::continuous_batch::PromptRequest— the unit of submission; carries the application’stenant_idso revocation can be routed back to the right tenant.grafos_tensor_kernels_cuda::lease_region::LeaseRegion::CudaDevice— the broker-owned device memory backing each weight tensor. Recipe 63 covers its revocation token.
See:
- Engine session handle (source)
- CudaForwardError variants (source)
- ContinuousBatchScheduler (source)
- Recipe 63 — intra-kernel revoke
- Recipe 64 — FENCED state detection
- Recipe 68 — continuous batching
Design
Resource Model
The engine holds two tiers of fabric-leased resources:
| Tier | Examples | Recovery |
|---|---|---|
| Engine-wide | model weights (Q4_K/Q5_0/Q8_0 packed tensors, embedding table, LM head, output norm) | Drop the engine; rebuild on a fresh model lease. Affects every tenant. |
| Tenant-scoped | per-sequence KV cache slots in the shared paged pool | Drop the tenant’s FabricNativeCudaSequenceHandle; re-admit through the scheduler. Surviving tenants unaffected. |
The recovery flow keys off which tier the failure originated in. The application layer (not the engine) owns that classification — the engine surfaces a typed error; the application’s broker layer tells the harness “this revocation was for tenant X’s KV slot, not the model weights.”
Continuity Contract
Surviving tenants:
- Keep their sequence handles.
- Keep their KV cache slots in the shared paged pool.
- Continue receiving
SchedulerEvent::Tokenevents on the next scheduler tick — only the revoked tenant’s request is removed from the active pool.
Affected tenant:
- Receives a
SchedulerEvent::Completedevent with aCompletionReasonthat surfaces the revocation cause. - The harness drops their
FabricNativeCudaSequenceHandle(cleanup of the KV pool slot is automatic viaDrop). - The harness re-submits a fresh
PromptRequestthrough the scheduler; the tenant’s quota slot reopens for the new request. - Prior emitted tokens for that tenant are lost (KV state isn’t preserved across rebind — a “checkpoint and resume” variant is filed separately).
Isolation and Safety
- The engine itself doesn’t track per-tenant lease provenance;
the application’s broker layer does. The harness reads the
PromptRequest.tenant_idfrom the in-flight active set and surfaces the affected tenant on revocation. - Cross-tenant memory access is structurally prevented by Recipe 66’s block-table indirection: even during rebind, surviving tenants’ KV slots are read through their own block tables.
Walkthrough (Implementation Sketch)
1. Drive the Scheduler Step, Catch Failure Events
use grafos_inference_engine::continuous_batch::{ CompletionReason, ContinuousBatchScheduler, PromptRequest, SchedulerError, SchedulerEvent,};
async fn serve_step<E>( scheduler: &mut ContinuousBatchScheduler<E>, broker: &dyn ApplicationBroker,) -> Result<(), ServingError>where E: /* InferenceEngine impl bounds */,{ let events = match scheduler.step().await { Ok(events) => events, Err(SchedulerError::EngineDecodeFailed(engine_err)) => { // The whole step failed (CUDA driver / intra-kernel // revoke). No sequence's `current_len` was mutated. return classify_and_recover_step_failure(broker, engine_err).await; } Err(SchedulerError::NoModelLoaded) => { return Err(ServingError::NoModelLoaded); } }; for event in events { match event { SchedulerEvent::Token { request, token, position } => { emit(request, token, position); } SchedulerEvent::Completed { request, reason } => { if let Some(prompt) = should_rebind_after(reason) { rebind_request(scheduler, broker, request, prompt).await?; } } SchedulerEvent::AdmissionRejected { request, reason, detail } => { surface_rejection(request, reason, detail); } } } Ok(())}2. Decide Whether a Completion Is a Rebind Trigger
fn should_rebind_after(reason: CompletionReason) -> Option<RebindIntent> { // Application policy — not every CompletionReason warrants a // retry. EOS / MaxTokens are normal terminations; revocation // and CUDA fatal are recovery triggers. match reason { CompletionReason::EndOfSequence => None, CompletionReason::MaxTokensReached => None, CompletionReason::EngineError(_) => Some(RebindIntent::ReSubmit), // Other variants are application-defined. _ => None, }}3. Re-Submit the Tenant Through the Scheduler
async fn rebind_request( scheduler: &mut ContinuousBatchScheduler<E>, broker: &dyn ApplicationBroker, request_id: RequestId, prompt: RebindIntent,) -> Result<(), ServingError> { let context = broker.context_for(request_id);
let fresh_request = PromptRequest { tenant_id: context.tenant_id, request_id: RequestId::new(), token_ids: context.original_prompt.clone(), sampling: context.sampling.clone(), max_emitted_tokens: context.max_emitted_tokens, };
scheduler.submit(fresh_request)?; Ok(())}The scheduler’s admission gate enforces the tenant’s
max_concurrent quota — so a misbehaving tenant that keeps
crashing can’t monopolize the rebind path.
Verification
This recipe composes three primitives whose individual contracts are verified by their own pins:
- Intra-kernel revoke (Recipe 63): the engine surfaces
IntraKernelRevokeat the next kernel-launch boundary after the broker flips a revocation token. - FENCED state (Recipe 64): after revocation, subsequent ops
against the dropped lease return
MissingWeightrather than silently succeeding. - Multi-tenant batched decode (Recipe 66): surviving tenants continue to produce correct tokens in a batched forward pass after one tenant is removed from the active pool.
The recovery sequence in this recipe is a pure application-layer
composition of those three. The ContinuousBatchScheduler’s
step() returns typed SchedulerError variants for engine-level
failures and per-request SchedulerEvent::Completed events for
in-flight failures — the same surface a non-revocation
completion path uses.
A dedicated end-to-end test that exercises the full rebind
sequence (continuous_batch_scheduler_recovers_one_tenant_under_revocation)
is filed as a follow-up against the continuous_batch test suite.
Failure Modes
- Misclassification of failure scope. If the harness treats
an engine-scoped failure as tenant-scoped, the rebind succeeds
(cheap) but the next
scheduler.step()returns the same error immediately. Mitigation: classify by inspecting the application-side broker state for whether the revoked lease was the model’s or the tenant’s, before invoking the rebind path. - Cascading rebind storms. If a tenant’s lease keeps getting revoked (e.g., it’s hitting quota every cycle), the harness rebinds every few seconds. The contract is correct but the observable behavior is bad. Mitigation: a per-tenant rebind rate limiter at the harness layer.
- Lost KV state. Rebind starts the tenant’s KV cache fresh — prior decoded tokens are gone. The tenant must re-submit the full original prompt (or a checkpointed prefix). For interactive workloads where the prior emit history matters semantically (e.g., a chat session mid-turn), the harness should preserve the prior emitted tokens and prepend them to the rebind prompt.
- Pool exhaustion on rebind. The new sequence handle needs a
KV pool slot. If the pool is full the scheduler emits
SchedulerEvent::AdmissionRejectedwith reasonPoolExhausted. The recovery path must handle this; typically by evicting another tenant (Recipe 26’s preemption pattern) or waiting for the next pool turnover.
Observability
Pair each rebind with a structured log + metric: tenant id, old request id, new request id, prompt length, rebind latency. The operator dashboard surfaces:
- Rebind rate per tenant (high rate → cascading-rebind storm).
- Rebind latency p50 / p99 — should be sub-second for tenant- scoped recoveries, multi-second only for engine-scoped (which involves a model re-acquire and falls outside this recipe’s scope).
- Concurrent live-tenant count — should NOT drop when a single tenant rebinds.
Variations
- Stateful rebind (checkpoint + resume). Preserve a tenant’s KV cache by writing it to fabric-leased memory before the rebind, then replaying it into the fresh sequence handle. Costs bandwidth (full KV checkpoint) but preserves session semantics across rebind.
- Pre-emptive rebind. Watch the lease’s TTL countdown; rebind proactively at TTL_REMAINING < threshold instead of reactively on revocation. Trades unnecessary work for zero user-visible rebind latency.
- Engine-pool rebinding. When an engine-scoped failure fires, the harness can route the entire workload to a backup engine on a different cell (Recipe 39’s cross-cloud pipeline pattern adapted to inference). The model lease is acquired on the backup before the primary is fully torn down.
Why This Is Recipe 67
Recipes 63 + 64 cover stopping cleanly. Recipe 67 covers continuing cleanly. Together they make fabric-leased inference operationally robust: a misbehaving tenant doesn’t take down the shared engine; it rebinds and the others don’t notice.