Recipe 67: Hot-Rebind Inference Continuity After a Lease Revocation

Situation

Recipes 63 and 64 establish how the engine stops cleanly when a lease is revoked. But “stop cleanly” isn’t the production contract. The production contract is “the SURVIVING tenants keep serving, and the affected tenant resumes on a fresh lease without restarting the engine.” A single tenant’s crash, TTL expiry, or quota-exceed event can’t take down inference for the rest of the shared GPU.

This recipe builds the recovery half of the revocation pattern: when the engine surfaces a typed revocation or fence error, the serving harness drops the affected tenant’s sequence handle, re-admits the tenant through the scheduler, and continues serving the other tenants without interruption. No engine restart, no full reload of weights.

What You Build

A serving harness around ContinuousBatchScheduler that:

Catches per-step EngineError::Fatal("lease revoked: intra-kernel poll") originating from CudaForwardError::IntraKernelRevoke.
Catches per-request MissingWeight errors that indicate a tensor backing a sequence handle has been freed.
Tracks which tenant’s lease was the trigger (the harness, not the engine, owns this mapping).
Drops the affected sequence handle, re-submits the tenant’s prompt through the scheduler, and lets surviving tenants continue to decode through the same ContinuousBatchScheduler instance.

Building Blocks

grafos_inference_engine::FabricNativeCudaEngine — the GPU engine; shared across all tenants on this GPU.
grafos_inference_engine::cuda_forward::CudaForwardError — the typed-error surface that carries IntraKernelRevoke and MissingWeight variants.
grafos_inference_engine::continuous_batch::ContinuousBatchScheduler — the per-engine scheduler that admits, batches, and reaps requests. Recipe 68 covers its full surface.
grafos_inference_engine::continuous_batch::PromptRequest — the unit of submission; carries the application’s tenant_id so revocation can be routed back to the right tenant.
grafos_tensor_kernels_cuda::lease_region::LeaseRegion::CudaDevice — the broker-owned device memory backing each weight tensor. Recipe 63 covers its revocation token.

See:

Design

Resource Model

The engine holds two tiers of fabric-leased resources:

Tier	Examples	Recovery
Engine-wide	model weights (Q4_K/Q5_0/Q8_0 packed tensors, embedding table, LM head, output norm)	Drop the engine; rebuild on a fresh model lease. Affects every tenant.
Tenant-scoped	per-sequence KV cache slots in the shared paged pool	Drop the tenant’s `FabricNativeCudaSequenceHandle`; re-admit through the scheduler. Surviving tenants unaffected.

The recovery flow keys off which tier the failure originated in. The application layer (not the engine) owns that classification — the engine surfaces a typed error; the application’s broker layer tells the harness “this revocation was for tenant X’s KV slot, not the model weights.”

Continuity Contract

Surviving tenants:

Keep their sequence handles.
Keep their KV cache slots in the shared paged pool.
Continue receiving SchedulerEvent::Token events on the next scheduler tick — only the revoked tenant’s request is removed from the active pool.

Affected tenant:

Receives a SchedulerEvent::Completed event with a CompletionReason that surfaces the revocation cause.
The harness drops their FabricNativeCudaSequenceHandle (cleanup of the KV pool slot is automatic via Drop).
The harness re-submits a fresh PromptRequest through the scheduler; the tenant’s quota slot reopens for the new request.
Prior emitted tokens for that tenant are lost (KV state isn’t preserved across rebind — a “checkpoint and resume” variant is filed separately).

Isolation and Safety

The engine itself doesn’t track per-tenant lease provenance; the application’s broker layer does. The harness reads the PromptRequest.tenant_id from the in-flight active set and surfaces the affected tenant on revocation.
Cross-tenant memory access is structurally prevented by Recipe 66’s block-table indirection: even during rebind, surviving tenants’ KV slots are read through their own block tables.

Walkthrough (Implementation Sketch)

1. Drive the Scheduler Step, Catch Failure Events

use grafos_inference_engine::continuous_batch::{
    CompletionReason, ContinuousBatchScheduler, PromptRequest,
    SchedulerError, SchedulerEvent,
};

async fn serve_step<E>(
    scheduler: &mut ContinuousBatchScheduler<E>,
    broker: &dyn ApplicationBroker,
) -> Result<(), ServingError>
where
    E: /* InferenceEngine impl bounds */,
{
    let events = match scheduler.step().await {
        Ok(events) => events,
        Err(SchedulerError::EngineDecodeFailed(engine_err)) => {
            // The whole step failed (CUDA driver / intra-kernel
            // revoke). No sequence's `current_len` was mutated.
            return classify_and_recover_step_failure(broker, engine_err).await;
        }
        Err(SchedulerError::NoModelLoaded) => {
            return Err(ServingError::NoModelLoaded);
        }
    };
    for event in events {
        match event {
            SchedulerEvent::Token { request, token, position } => {
                emit(request, token, position);
            }
            SchedulerEvent::Completed { request, reason } => {
                if let Some(prompt) = should_rebind_after(reason) {
                    rebind_request(scheduler, broker, request, prompt).await?;
                }
            }
            SchedulerEvent::AdmissionRejected { request, reason, detail } => {
                surface_rejection(request, reason, detail);
            }
        }
    }
    Ok(())
}

2. Decide Whether a Completion Is a Rebind Trigger

fn should_rebind_after(reason: CompletionReason) -> Option<RebindIntent> {
    // Application policy — not every CompletionReason warrants a
    // retry. EOS / MaxTokens are normal terminations; revocation
    // and CUDA fatal are recovery triggers.
    match reason {
        CompletionReason::EndOfSequence => None,
        CompletionReason::MaxTokensReached => None,
        CompletionReason::EngineError(_) => Some(RebindIntent::ReSubmit),
        // Other variants are application-defined.
        _ => None,
    }
}

3. Re-Submit the Tenant Through the Scheduler

async fn rebind_request(
    scheduler: &mut ContinuousBatchScheduler<E>,
    broker: &dyn ApplicationBroker,
    request_id: RequestId,
    prompt: RebindIntent,
) -> Result<(), ServingError> {
    let context = broker.context_for(request_id);

    let fresh_request = PromptRequest {
        tenant_id: context.tenant_id,
        request_id: RequestId::new(),
        token_ids: context.original_prompt.clone(),
        sampling: context.sampling.clone(),
        max_emitted_tokens: context.max_emitted_tokens,
    };

    scheduler.submit(fresh_request)?;
    Ok(())
}

The scheduler’s admission gate enforces the tenant’s max_concurrent quota — so a misbehaving tenant that keeps crashing can’t monopolize the rebind path.

Verification

This recipe composes three primitives whose individual contracts are verified by their own pins:

Intra-kernel revoke (Recipe 63): the engine surfaces IntraKernelRevoke at the next kernel-launch boundary after the broker flips a revocation token.
FENCED state (Recipe 64): after revocation, subsequent ops against the dropped lease return MissingWeight rather than silently succeeding.
Multi-tenant batched decode (Recipe 66): surviving tenants continue to produce correct tokens in a batched forward pass after one tenant is removed from the active pool.

The recovery sequence in this recipe is a pure application-layer composition of those three. The ContinuousBatchScheduler’s step() returns typed SchedulerError variants for engine-level failures and per-request SchedulerEvent::Completed events for in-flight failures — the same surface a non-revocation completion path uses.

A dedicated end-to-end test that exercises the full rebind sequence (continuous_batch_scheduler_recovers_one_tenant_under_revocation) is filed as a follow-up against the continuous_batch test suite.

Failure Modes

Misclassification of failure scope. If the harness treats an engine-scoped failure as tenant-scoped, the rebind succeeds (cheap) but the next scheduler.step() returns the same error immediately. Mitigation: classify by inspecting the application-side broker state for whether the revoked lease was the model’s or the tenant’s, before invoking the rebind path.
Cascading rebind storms. If a tenant’s lease keeps getting revoked (e.g., it’s hitting quota every cycle), the harness rebinds every few seconds. The contract is correct but the observable behavior is bad. Mitigation: a per-tenant rebind rate limiter at the harness layer.
Lost KV state. Rebind starts the tenant’s KV cache fresh — prior decoded tokens are gone. The tenant must re-submit the full original prompt (or a checkpointed prefix). For interactive workloads where the prior emit history matters semantically (e.g., a chat session mid-turn), the harness should preserve the prior emitted tokens and prepend them to the rebind prompt.
Pool exhaustion on rebind. The new sequence handle needs a KV pool slot. If the pool is full the scheduler emits SchedulerEvent::AdmissionRejected with reason PoolExhausted. The recovery path must handle this; typically by evicting another tenant (Recipe 26’s preemption pattern) or waiting for the next pool turnover.

Observability

Pair each rebind with a structured log + metric: tenant id, old request id, new request id, prompt length, rebind latency. The operator dashboard surfaces:

Rebind rate per tenant (high rate → cascading-rebind storm).
Rebind latency p50 / p99 — should be sub-second for tenant- scoped recoveries, multi-second only for engine-scoped (which involves a model re-acquire and falls outside this recipe’s scope).
Concurrent live-tenant count — should NOT drop when a single tenant rebinds.

Variations

Stateful rebind (checkpoint + resume). Preserve a tenant’s KV cache by writing it to fabric-leased memory before the rebind, then replaying it into the fresh sequence handle. Costs bandwidth (full KV checkpoint) but preserves session semantics across rebind.
Pre-emptive rebind. Watch the lease’s TTL countdown; rebind proactively at TTL_REMAINING < threshold instead of reactively on revocation. Trades unnecessary work for zero user-visible rebind latency.
Engine-pool rebinding. When an engine-scoped failure fires, the harness can route the entire workload to a backup engine on a different cell (Recipe 39’s cross-cloud pipeline pattern adapted to inference). The model lease is acquired on the backup before the primary is fully torn down.

Why This Is Recipe 67

Recipes 63 + 64 cover stopping cleanly. Recipe 67 covers continuing cleanly. Together they make fabric-leased inference operationally robust: a misbehaving tenant doesn’t take down the shared engine; it rebinds and the others don’t notice.