Skip to content

Recipe 67: Hot-Rebind Inference Continuity After a Lease Revocation

Situation

Recipes 63 and 64 establish how the engine stops cleanly when a lease is revoked. But “stop cleanly” isn’t the production contract. The production contract is “the SURVIVING tenants keep serving, and the affected tenant resumes on a fresh lease without restarting the engine.” A single tenant’s crash, TTL expiry, or quota-exceed event can’t take down inference for the rest of the shared GPU.

This recipe builds the recovery half of the revocation pattern: when the engine surfaces a typed revocation or fence error, the serving harness drops the affected tenant’s sequence handle, re-admits the tenant through the scheduler, and continues serving the other tenants without interruption. No engine restart, no full reload of weights.

What You Build

A serving harness around ContinuousBatchScheduler that:

  1. Catches per-step EngineError::Fatal("lease revoked: intra-kernel poll") originating from CudaForwardError::IntraKernelRevoke.
  2. Catches per-request MissingWeight errors that indicate a tensor backing a sequence handle has been freed.
  3. Tracks which tenant’s lease was the trigger (the harness, not the engine, owns this mapping).
  4. Drops the affected sequence handle, re-submits the tenant’s prompt through the scheduler, and lets surviving tenants continue to decode through the same ContinuousBatchScheduler instance.

Building Blocks

  • grafos_inference_engine::FabricNativeCudaEngine — the GPU engine; shared across all tenants on this GPU.
  • grafos_inference_engine::cuda_forward::CudaForwardError — the typed-error surface that carries IntraKernelRevoke and MissingWeight variants.
  • grafos_inference_engine::continuous_batch::ContinuousBatchScheduler — the per-engine scheduler that admits, batches, and reaps requests. Recipe 68 covers its full surface.
  • grafos_inference_engine::continuous_batch::PromptRequest — the unit of submission; carries the application’s tenant_id so revocation can be routed back to the right tenant.
  • grafos_tensor_kernels_cuda::lease_region::LeaseRegion::CudaDevice — the broker-owned device memory backing each weight tensor. Recipe 63 covers its revocation token.

See:

Design

Resource Model

The engine holds two tiers of fabric-leased resources:

TierExamplesRecovery
Engine-widemodel weights (Q4_K/Q5_0/Q8_0 packed tensors, embedding table, LM head, output norm)Drop the engine; rebuild on a fresh model lease. Affects every tenant.
Tenant-scopedper-sequence KV cache slots in the shared paged poolDrop the tenant’s FabricNativeCudaSequenceHandle; re-admit through the scheduler. Surviving tenants unaffected.

The recovery flow keys off which tier the failure originated in. The application layer (not the engine) owns that classification — the engine surfaces a typed error; the application’s broker layer tells the harness “this revocation was for tenant X’s KV slot, not the model weights.”

Continuity Contract

Surviving tenants:

  • Keep their sequence handles.
  • Keep their KV cache slots in the shared paged pool.
  • Continue receiving SchedulerEvent::Token events on the next scheduler tick — only the revoked tenant’s request is removed from the active pool.

Affected tenant:

  • Receives a SchedulerEvent::Completed event with a CompletionReason that surfaces the revocation cause.
  • The harness drops their FabricNativeCudaSequenceHandle (cleanup of the KV pool slot is automatic via Drop).
  • The harness re-submits a fresh PromptRequest through the scheduler; the tenant’s quota slot reopens for the new request.
  • Prior emitted tokens for that tenant are lost (KV state isn’t preserved across rebind — a “checkpoint and resume” variant is filed separately).

Isolation and Safety

  • The engine itself doesn’t track per-tenant lease provenance; the application’s broker layer does. The harness reads the PromptRequest.tenant_id from the in-flight active set and surfaces the affected tenant on revocation.
  • Cross-tenant memory access is structurally prevented by Recipe 66’s block-table indirection: even during rebind, surviving tenants’ KV slots are read through their own block tables.

Walkthrough (Implementation Sketch)

1. Drive the Scheduler Step, Catch Failure Events

use grafos_inference_engine::continuous_batch::{
CompletionReason, ContinuousBatchScheduler, PromptRequest,
SchedulerError, SchedulerEvent,
};
async fn serve_step<E>(
scheduler: &mut ContinuousBatchScheduler<E>,
broker: &dyn ApplicationBroker,
) -> Result<(), ServingError>
where
E: /* InferenceEngine impl bounds */,
{
let events = match scheduler.step().await {
Ok(events) => events,
Err(SchedulerError::EngineDecodeFailed(engine_err)) => {
// The whole step failed (CUDA driver / intra-kernel
// revoke). No sequence's `current_len` was mutated.
return classify_and_recover_step_failure(broker, engine_err).await;
}
Err(SchedulerError::NoModelLoaded) => {
return Err(ServingError::NoModelLoaded);
}
};
for event in events {
match event {
SchedulerEvent::Token { request, token, position } => {
emit(request, token, position);
}
SchedulerEvent::Completed { request, reason } => {
if let Some(prompt) = should_rebind_after(reason) {
rebind_request(scheduler, broker, request, prompt).await?;
}
}
SchedulerEvent::AdmissionRejected { request, reason, detail } => {
surface_rejection(request, reason, detail);
}
}
}
Ok(())
}

2. Decide Whether a Completion Is a Rebind Trigger

fn should_rebind_after(reason: CompletionReason) -> Option<RebindIntent> {
// Application policy — not every CompletionReason warrants a
// retry. EOS / MaxTokens are normal terminations; revocation
// and CUDA fatal are recovery triggers.
match reason {
CompletionReason::EndOfSequence => None,
CompletionReason::MaxTokensReached => None,
CompletionReason::EngineError(_) => Some(RebindIntent::ReSubmit),
// Other variants are application-defined.
_ => None,
}
}

3. Re-Submit the Tenant Through the Scheduler

async fn rebind_request(
scheduler: &mut ContinuousBatchScheduler<E>,
broker: &dyn ApplicationBroker,
request_id: RequestId,
prompt: RebindIntent,
) -> Result<(), ServingError> {
let context = broker.context_for(request_id);
let fresh_request = PromptRequest {
tenant_id: context.tenant_id,
request_id: RequestId::new(),
token_ids: context.original_prompt.clone(),
sampling: context.sampling.clone(),
max_emitted_tokens: context.max_emitted_tokens,
};
scheduler.submit(fresh_request)?;
Ok(())
}

The scheduler’s admission gate enforces the tenant’s max_concurrent quota — so a misbehaving tenant that keeps crashing can’t monopolize the rebind path.

Verification

This recipe composes three primitives whose individual contracts are verified by their own pins:

  • Intra-kernel revoke (Recipe 63): the engine surfaces IntraKernelRevoke at the next kernel-launch boundary after the broker flips a revocation token.
  • FENCED state (Recipe 64): after revocation, subsequent ops against the dropped lease return MissingWeight rather than silently succeeding.
  • Multi-tenant batched decode (Recipe 66): surviving tenants continue to produce correct tokens in a batched forward pass after one tenant is removed from the active pool.

The recovery sequence in this recipe is a pure application-layer composition of those three. The ContinuousBatchScheduler’s step() returns typed SchedulerError variants for engine-level failures and per-request SchedulerEvent::Completed events for in-flight failures — the same surface a non-revocation completion path uses.

A dedicated end-to-end test that exercises the full rebind sequence (continuous_batch_scheduler_recovers_one_tenant_under_revocation) is filed as a follow-up against the continuous_batch test suite.

Failure Modes

  • Misclassification of failure scope. If the harness treats an engine-scoped failure as tenant-scoped, the rebind succeeds (cheap) but the next scheduler.step() returns the same error immediately. Mitigation: classify by inspecting the application-side broker state for whether the revoked lease was the model’s or the tenant’s, before invoking the rebind path.
  • Cascading rebind storms. If a tenant’s lease keeps getting revoked (e.g., it’s hitting quota every cycle), the harness rebinds every few seconds. The contract is correct but the observable behavior is bad. Mitigation: a per-tenant rebind rate limiter at the harness layer.
  • Lost KV state. Rebind starts the tenant’s KV cache fresh — prior decoded tokens are gone. The tenant must re-submit the full original prompt (or a checkpointed prefix). For interactive workloads where the prior emit history matters semantically (e.g., a chat session mid-turn), the harness should preserve the prior emitted tokens and prepend them to the rebind prompt.
  • Pool exhaustion on rebind. The new sequence handle needs a KV pool slot. If the pool is full the scheduler emits SchedulerEvent::AdmissionRejected with reason PoolExhausted. The recovery path must handle this; typically by evicting another tenant (Recipe 26’s preemption pattern) or waiting for the next pool turnover.

Observability

Pair each rebind with a structured log + metric: tenant id, old request id, new request id, prompt length, rebind latency. The operator dashboard surfaces:

  • Rebind rate per tenant (high rate → cascading-rebind storm).
  • Rebind latency p50 / p99 — should be sub-second for tenant- scoped recoveries, multi-second only for engine-scoped (which involves a model re-acquire and falls outside this recipe’s scope).
  • Concurrent live-tenant count — should NOT drop when a single tenant rebinds.

Variations

  • Stateful rebind (checkpoint + resume). Preserve a tenant’s KV cache by writing it to fabric-leased memory before the rebind, then replaying it into the fresh sequence handle. Costs bandwidth (full KV checkpoint) but preserves session semantics across rebind.
  • Pre-emptive rebind. Watch the lease’s TTL countdown; rebind proactively at TTL_REMAINING < threshold instead of reactively on revocation. Trades unnecessary work for zero user-visible rebind latency.
  • Engine-pool rebinding. When an engine-scoped failure fires, the harness can route the entire workload to a backup engine on a different cell (Recipe 39’s cross-cloud pipeline pattern adapted to inference). The model lease is acquired on the backup before the primary is fully torn down.

Why This Is Recipe 67

Recipes 63 + 64 cover stopping cleanly. Recipe 67 covers continuing cleanly. Together they make fabric-leased inference operationally robust: a misbehaving tenant doesn’t take down the shared engine; it rebinds and the others don’t notice.