Skip to content

Recipe 68: Continuous Batching With Per-Tenant Quotas

Situation

Recipe 66 establishes the kernel-level primitive: N tenants share one decode forward. But production serving is messier than “load N prompts upfront and decode them together.” In reality:

  • Requests arrive dynamically — at any moment, some tenants are in prefill, some are mid-decode, some are about to complete.
  • Tenants have different priorities — a paid-tier tenant should not starve while a free-tier tenant takes the entire batch.
  • The GPU’s batch size is bounded — when more requests arrive than the batch holds, the scheduler must choose who decodes now and who waits.

Continuous batching is the standard pattern for handling this churn. The scheduler:

  1. Holds a pending-prompt queue.
  2. On each decode tick, packs ready sequences into one decode_step call up to max_batch_size.
  3. Enforces per-tenant quotas — max_concurrent (active sequence count) and max_blocks (KV cache blocks held).
  4. Admits new prompts from the queue as ongoing sequences complete.

The contract: every tenant gets fair access bounded by its quota, no tenant starves, and the batched forward stays correct per Recipe 66.

What You Build

A ContinuousBatchScheduler around the engine that implements the dynamic admission + quota loop. Tenants submit prompts at any time; the scheduler routes them into ongoing batched decodes; per-tenant concurrent decode counts are bounded by TenantConfig.max_concurrent and per-tenant KV usage is bounded by TenantConfig.max_blocks; the no-starvation invariant holds for every registered tenant.

Building Blocks

  • grafos_inference_engine::continuous_batch::ContinuousBatchScheduler — owns the queue, the engine handle, and the decode tick.
  • grafos_inference_engine::continuous_batch::TenantConfig — per-tenant policy (id, max_concurrent, max_blocks, weight).
  • grafos_inference_engine::continuous_batch::PromptRequest — one admitted request (tenant id, request id, prompt token ids, sampling params, max emitted tokens).
  • grafos_inference_engine::continuous_batch::BatchSchedulerConfig — scheduler-wide policy (max batch size, max pending queue depth, prefill strategy).
  • grafos_inference_engine::continuous_batch::PrefillStrategy::BlockingBeforeDecode — the prefill scheduling mode shipping today (chunked-prefill is filed as a follow-up).
  • grafos_inference_engine::continuous_batch::SchedulerEvent — the typed event stream returned from step(): Token { request, token, position }, Completed { request, reason }, AdmissionRejected { request, reason, detail }.
  • grafos_inference_engine::continuous_batch::QuotaKind — the discriminator on quota-exceeded admission rejections (ConcurrentRequests or KvBlocks).
  • grafos_inference_engine::continuous_batch::CompletionReason — why a request was reaped (end of sequence, max tokens reached, engine error, etc.).
  • Recipe 66’s decode_step substrate — the kernel-level batched forward this scheduler invokes per tick.

See:

Design

Resource Model

The scheduler owns:

ResourceProvenanceScope
engine: Arc<AsyncMutex<FabricNativeCudaEngine>>one per scheduler instanceengine-wide model weights, shared K/V pool
model_handle: FabricNativeCudaModelHandleone per loaded modelstable across the scheduler’s lifetime
Per-tenant TenantConfig registryapplication providesquota policy lookup at admit time
Per-tenant accounting (active_count, allocated_blocks)scheduler maintainsenforced against max_concurrent / max_blocks
BatchSchedulerConfig.max_pending_promptsglobal queue caprejects with QueueFull when reached
BatchSchedulerConfig.max_batch_sizeglobal batch capper-tick admission target

Each admitted request maps to one sequence handle on the engine. On completion or eviction, the handle is dropped (KV slot returns to the pool automatically via Drop).

Admission Algorithm (Weighted-Fair)

On each step() call:

  1. Sweep the active set: for each in-flight request, observe its most recent decoded token; emit a SchedulerEvent::Token; check completion conditions (EOS, max_emitted_tokens, engine-reported error).
  2. Reap completed requests: emit SchedulerEvent::Completed; release the tenant’s active_count and allocated_blocks.
  3. Admit from the pending queue while active_count < max_batch_size. Weighted-fair across tenants: TenantConfig.weight controls the deficit-round-robin admission rate (weight=2.0 admits at ~2× the rate of weight=1.0).
  4. For each admission candidate, check per-tenant quotas. Reject with AdmissionRejection::TenantQuotaExceeded { which: QuotaKind } if active_count >= max_concurrent or allocated_blocks + estimate > max_blocks.
  5. Call engine.decode_step(active_handles, samplings) once for the entire admitted batch (Recipe 66’s substrate).

Isolation and Safety

  • Per-tenant quotas isolate compute share between tenants. A tenant that submits 100 prompts but has max_concurrent=2 gets at most 2 admitted at a time; the remaining 98 wait in the queue.
  • Per-sequence block-table indirection (Recipe 66) isolates KV cache reads at the kernel level — a quota breach at admit time is a refusal, not a leak.
  • Per-tenant lease revocation (Recipes 63 + 64) isolates failure modes — one tenant’s broker-revoke surfaces as that tenant’s SchedulerEvent::Completed { reason }; other tenants continue.
  • The scheduler’s Drop ensures all active handles are released cleanly on shutdown.

Walkthrough (Implementation Sketch)

1. Configure the Scheduler

use grafos_inference_engine::continuous_batch::{
BatchSchedulerConfig, ContinuousBatchScheduler, PrefillStrategy,
PromptRequest, RequestId, TenantConfig, TenantId,
};
let scheduler_cfg = BatchSchedulerConfig {
max_batch_size: 16, // GPU forward batch upper bound
max_pending_prompts: 256, // queue capacity
prefill_strategy: PrefillStrategy::BlockingBeforeDecode,
// ... other config fields
};
let mut sched = ContinuousBatchScheduler::new(
engine_arc,
model_handle,
scheduler_cfg,
);
// Register tenants with their quotas.
sched.register_tenant(
TenantConfig::new("tenant-a")
.with_max_concurrent(2)
.with_max_blocks(32)
.with_weight(2.0), // paid tier, 2× admit rate
);
sched.register_tenant(
TenantConfig::new("tenant-b")
.with_max_concurrent(2)
.with_max_blocks(32)
.with_weight(1.0), // free tier
);

2. Submit Requests Across Many Tenants

sched.submit(PromptRequest {
tenant_id: TenantId("tenant-a".into()),
request_id: RequestId::new(),
token_ids: prompt_ids_1,
sampling: SamplingParams::greedy(64),
max_emitted_tokens: 64,
})?;
// ... arbitrary submissions interleaved across tenants

3. Drive the Tick Loop

use grafos_inference_engine::continuous_batch::{
CompletionReason, SchedulerEvent,
};
loop {
let events = sched.step().await?;
for event in events {
match event {
SchedulerEvent::Token { request, token, position } => {
// Forward to per-request output sink.
forward(request, token, position);
}
SchedulerEvent::Completed { request, reason } => {
match reason {
CompletionReason::EndOfSequence => mark_done(request),
CompletionReason::MaxTokensReached => mark_done(request),
CompletionReason::EngineError(e) => surface_error(request, e),
// ... other reasons
}
}
SchedulerEvent::AdmissionRejected { request, reason, detail } => {
// Tenant hit a quota or the queue is full.
surface_rejection(request, reason, detail);
}
}
}
}

4. Inspect Per-Tenant State

use grafos_inference_engine::continuous_batch::TenantStateSnapshot;
let snap: TenantStateSnapshot = sched.tenant_state("tenant-a".into())
.expect("registered");
// snap.active_count, snap.allocated_blocks, snap.max_concurrent,
// snap.max_blocks, snap.weight
metrics.record_tenant_state(snap);

Verification

Recipe 66’s batched-decode equivalence is the kernel-level correctness pin (batched output is byte-identical to per-tenant serial output for N=4). Recipe 68 layers admission + quota enforcement on top.

The scheduler’s no-starvation invariant is exercised by cuda_engine_e2e_multi_tenant_4_tenants_at_quota_2_no_starvation in the engine’s e2e tests: 4 tenants at max_concurrent=2, every tenant eventually admitted, no tenant permanently blocked.

Run the pin yourself (requires the continuous-batch feature):

Terminal window
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
-p grafos-inference-engine \
--features cuda,continuous-batch,test-helpers \
--test cuda_engine_e2e \
-- --ignored --nocapture --test-threads=1 \
cuda_engine_e2e_multi_tenant_4_tenants_at_quota_2_no_starvation

Failure Modes

  • Quota mis-accounting on retry. If a tenant’s request fails mid-decode and is retried (Recipe 67’s rebind), the scheduler must release the old quota slot before admitting the retry. Otherwise the tenant’s active_count drifts above its cap. The scheduler’s Drop on the sequence handle handles this; the rebind path must not double-acquire.
  • Priority inversion. A low-priority tenant can hold a long-context request whose forward pass dominates wall-clock, starving high-priority tenants. Mitigation: scheduler-level preemption of long-running low-priority requests (Recipe 26).
  • Pool fragmentation. Tenants with different context lengths fragment the KV pool such that the next admission’s block needs become unsatisfiable even when totals look fine. Pool compaction during quiescent ticks mitigates.
  • Prefill blocking decode. With PrefillStrategy::BlockingBeforeDecode, a new tenant’s long prefill stalls every other tenant’s next decode tick because the engine mutex is held across both submit_prompt and decode_step. A chunked-prefill strategy that interleaves prefill chunks with decode steps is filed as a follow-up; the blocking strategy ships today.

Observability

Each step() returns a slice of SchedulerEvents; the operator dashboard aggregates them into per-tenant counters: queue depth, admitted count, decoded-token rate, rejection count by AdmissionRejection reason, time-to-first-token p50/p99.

TenantStateSnapshot exposes the live active_count / allocated_blocks / quota config for any registered tenant on demand — suitable for a scrape endpoint.

Variations

  • Speculative decode per request. Recipe 65’s draft+target pattern layered under this scheduler — each admitted request internally runs draft+target. Throughput multiplies but quota accounting must charge target-side compute, not draft-side, to match user-perceived work.
  • Heterogeneous models per tenant. Different tenants on different models on the same engine pool isn’t directly supported by this scheduler (one engine = one model). Routing per-tenant to per-model schedulers is the upstream layer (Recipe 51’s multi-cloud deploy pattern adapted to inference).
  • Token-bucket quota refill. Static max_concurrent is the simplest case. A token-bucket refill (R requests/sec, B burst) fits the same shape with a per-tenant rate limiter at the admission gate.

Why This Is Recipe 68

Recipe 66 demonstrates the kernel can batch N tenants. Recipe 68 demonstrates that the scheduler layer keeps the batch full under realistic request churn, with quota fairness preserved. Together they make production LLM serving on a fabric-leased GPU operationally feasible — not just a benchmarked primitive.