Recipe 68: Continuous Batching With Per-Tenant Quotas
Situation
Recipe 66 establishes the kernel-level primitive: N tenants share one decode forward. But production serving is messier than “load N prompts upfront and decode them together.” In reality:
- Requests arrive dynamically — at any moment, some tenants are in prefill, some are mid-decode, some are about to complete.
- Tenants have different priorities — a paid-tier tenant should not starve while a free-tier tenant takes the entire batch.
- The GPU’s batch size is bounded — when more requests arrive than the batch holds, the scheduler must choose who decodes now and who waits.
Continuous batching is the standard pattern for handling this churn. The scheduler:
- Holds a pending-prompt queue.
- On each decode tick, packs ready sequences into one
decode_stepcall up tomax_batch_size. - Enforces per-tenant quotas —
max_concurrent(active sequence count) andmax_blocks(KV cache blocks held). - Admits new prompts from the queue as ongoing sequences complete.
The contract: every tenant gets fair access bounded by its quota, no tenant starves, and the batched forward stays correct per Recipe 66.
What You Build
A ContinuousBatchScheduler around the engine that implements
the dynamic admission + quota loop. Tenants submit prompts at
any time; the scheduler routes them into ongoing batched decodes;
per-tenant concurrent decode counts are bounded by
TenantConfig.max_concurrent and per-tenant KV usage is bounded
by TenantConfig.max_blocks; the no-starvation invariant holds
for every registered tenant.
Building Blocks
grafos_inference_engine::continuous_batch::ContinuousBatchScheduler— owns the queue, the engine handle, and the decode tick.grafos_inference_engine::continuous_batch::TenantConfig— per-tenant policy (id,max_concurrent,max_blocks,weight).grafos_inference_engine::continuous_batch::PromptRequest— one admitted request (tenant id, request id, prompt token ids, sampling params, max emitted tokens).grafos_inference_engine::continuous_batch::BatchSchedulerConfig— scheduler-wide policy (max batch size, max pending queue depth, prefill strategy).grafos_inference_engine::continuous_batch::PrefillStrategy::BlockingBeforeDecode— the prefill scheduling mode shipping today (chunked-prefill is filed as a follow-up).grafos_inference_engine::continuous_batch::SchedulerEvent— the typed event stream returned fromstep():Token { request, token, position },Completed { request, reason },AdmissionRejected { request, reason, detail }.grafos_inference_engine::continuous_batch::QuotaKind— the discriminator on quota-exceeded admission rejections (ConcurrentRequestsorKvBlocks).grafos_inference_engine::continuous_batch::CompletionReason— why a request was reaped (end of sequence, max tokens reached, engine error, etc.).- Recipe 66’s
decode_stepsubstrate — the kernel-level batched forward this scheduler invokes per tick.
See:
Design
Resource Model
The scheduler owns:
| Resource | Provenance | Scope |
|---|---|---|
engine: Arc<AsyncMutex<FabricNativeCudaEngine>> | one per scheduler instance | engine-wide model weights, shared K/V pool |
model_handle: FabricNativeCudaModelHandle | one per loaded model | stable across the scheduler’s lifetime |
Per-tenant TenantConfig registry | application provides | quota policy lookup at admit time |
Per-tenant accounting (active_count, allocated_blocks) | scheduler maintains | enforced against max_concurrent / max_blocks |
BatchSchedulerConfig.max_pending_prompts | global queue cap | rejects with QueueFull when reached |
BatchSchedulerConfig.max_batch_size | global batch cap | per-tick admission target |
Each admitted request maps to one sequence handle on the engine.
On completion or eviction, the handle is dropped (KV slot returns
to the pool automatically via Drop).
Admission Algorithm (Weighted-Fair)
On each step() call:
- Sweep the active set: for each in-flight request, observe its
most recent decoded token; emit a
SchedulerEvent::Token; check completion conditions (EOS, max_emitted_tokens, engine-reported error). - Reap completed requests: emit
SchedulerEvent::Completed; release the tenant’sactive_countandallocated_blocks. - Admit from the pending queue while
active_count < max_batch_size. Weighted-fair across tenants:TenantConfig.weightcontrols the deficit-round-robin admission rate (weight=2.0admits at ~2× the rate ofweight=1.0). - For each admission candidate, check per-tenant quotas. Reject
with
AdmissionRejection::TenantQuotaExceeded { which: QuotaKind }ifactive_count >= max_concurrentorallocated_blocks + estimate > max_blocks. - Call
engine.decode_step(active_handles, samplings)once for the entire admitted batch (Recipe 66’s substrate).
Isolation and Safety
- Per-tenant quotas isolate compute share between tenants. A
tenant that submits 100 prompts but has
max_concurrent=2gets at most 2 admitted at a time; the remaining 98 wait in the queue. - Per-sequence block-table indirection (Recipe 66) isolates KV cache reads at the kernel level — a quota breach at admit time is a refusal, not a leak.
- Per-tenant lease revocation (Recipes 63 + 64) isolates failure
modes — one tenant’s broker-revoke surfaces as that tenant’s
SchedulerEvent::Completed { reason }; other tenants continue. - The scheduler’s
Dropensures all active handles are released cleanly on shutdown.
Walkthrough (Implementation Sketch)
1. Configure the Scheduler
use grafos_inference_engine::continuous_batch::{ BatchSchedulerConfig, ContinuousBatchScheduler, PrefillStrategy, PromptRequest, RequestId, TenantConfig, TenantId,};
let scheduler_cfg = BatchSchedulerConfig { max_batch_size: 16, // GPU forward batch upper bound max_pending_prompts: 256, // queue capacity prefill_strategy: PrefillStrategy::BlockingBeforeDecode, // ... other config fields};
let mut sched = ContinuousBatchScheduler::new( engine_arc, model_handle, scheduler_cfg,);
// Register tenants with their quotas.sched.register_tenant( TenantConfig::new("tenant-a") .with_max_concurrent(2) .with_max_blocks(32) .with_weight(2.0), // paid tier, 2× admit rate);sched.register_tenant( TenantConfig::new("tenant-b") .with_max_concurrent(2) .with_max_blocks(32) .with_weight(1.0), // free tier);2. Submit Requests Across Many Tenants
sched.submit(PromptRequest { tenant_id: TenantId("tenant-a".into()), request_id: RequestId::new(), token_ids: prompt_ids_1, sampling: SamplingParams::greedy(64), max_emitted_tokens: 64,})?;// ... arbitrary submissions interleaved across tenants3. Drive the Tick Loop
use grafos_inference_engine::continuous_batch::{ CompletionReason, SchedulerEvent,};
loop { let events = sched.step().await?; for event in events { match event { SchedulerEvent::Token { request, token, position } => { // Forward to per-request output sink. forward(request, token, position); } SchedulerEvent::Completed { request, reason } => { match reason { CompletionReason::EndOfSequence => mark_done(request), CompletionReason::MaxTokensReached => mark_done(request), CompletionReason::EngineError(e) => surface_error(request, e), // ... other reasons } } SchedulerEvent::AdmissionRejected { request, reason, detail } => { // Tenant hit a quota or the queue is full. surface_rejection(request, reason, detail); } } }}4. Inspect Per-Tenant State
use grafos_inference_engine::continuous_batch::TenantStateSnapshot;
let snap: TenantStateSnapshot = sched.tenant_state("tenant-a".into()) .expect("registered");// snap.active_count, snap.allocated_blocks, snap.max_concurrent,// snap.max_blocks, snap.weightmetrics.record_tenant_state(snap);Verification
Recipe 66’s batched-decode equivalence is the kernel-level correctness pin (batched output is byte-identical to per-tenant serial output for N=4). Recipe 68 layers admission + quota enforcement on top.
The scheduler’s no-starvation invariant is exercised by
cuda_engine_e2e_multi_tenant_4_tenants_at_quota_2_no_starvation
in the engine’s e2e tests: 4 tenants at max_concurrent=2,
every tenant eventually admitted, no tenant permanently blocked.
Run the pin yourself (requires the continuous-batch feature):
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \cargo test --release \ -p grafos-inference-engine \ --features cuda,continuous-batch,test-helpers \ --test cuda_engine_e2e \ -- --ignored --nocapture --test-threads=1 \ cuda_engine_e2e_multi_tenant_4_tenants_at_quota_2_no_starvationFailure Modes
- Quota mis-accounting on retry. If a tenant’s request fails
mid-decode and is retried (Recipe 67’s rebind), the scheduler
must release the old quota slot before admitting the retry.
Otherwise the tenant’s
active_countdrifts above its cap. The scheduler’sDropon the sequence handle handles this; the rebind path must not double-acquire. - Priority inversion. A low-priority tenant can hold a long-context request whose forward pass dominates wall-clock, starving high-priority tenants. Mitigation: scheduler-level preemption of long-running low-priority requests (Recipe 26).
- Pool fragmentation. Tenants with different context lengths fragment the KV pool such that the next admission’s block needs become unsatisfiable even when totals look fine. Pool compaction during quiescent ticks mitigates.
- Prefill blocking decode. With
PrefillStrategy::BlockingBeforeDecode, a new tenant’s long prefill stalls every other tenant’s next decode tick because the engine mutex is held across bothsubmit_promptanddecode_step. A chunked-prefill strategy that interleaves prefill chunks with decode steps is filed as a follow-up; the blocking strategy ships today.
Observability
Each step() returns a slice of SchedulerEvents; the operator
dashboard aggregates them into per-tenant counters:
queue depth, admitted count, decoded-token rate, rejection
count by AdmissionRejection reason, time-to-first-token p50/p99.
TenantStateSnapshot exposes the live active_count /
allocated_blocks / quota config for any registered tenant on
demand — suitable for a scrape endpoint.
Variations
- Speculative decode per request. Recipe 65’s draft+target pattern layered under this scheduler — each admitted request internally runs draft+target. Throughput multiplies but quota accounting must charge target-side compute, not draft-side, to match user-perceived work.
- Heterogeneous models per tenant. Different tenants on different models on the same engine pool isn’t directly supported by this scheduler (one engine = one model). Routing per-tenant to per-model schedulers is the upstream layer (Recipe 51’s multi-cloud deploy pattern adapted to inference).
- Token-bucket quota refill. Static
max_concurrentis the simplest case. A token-bucket refill (R requests/sec, B burst) fits the same shape with a per-tenant rate limiter at the admission gate.
Why This Is Recipe 68
Recipe 66 demonstrates the kernel can batch N tenants. Recipe 68 demonstrates that the scheduler layer keeps the batch full under realistic request churn, with quota fairness preserved. Together they make production LLM serving on a fabric-leased GPU operationally feasible — not just a benchmarked primitive.