Recipe 26: Multi-Tenant Compute With Preemption and Attestation
Situation
Multiple teams share a compute fabric. Without policy enforcement, one team can starve others. You need admission control, per-tenant quotas, preemption of lower-priority work, and attestation that nodes are running trusted firmware. Traditional schedulers are complex infrastructure projects with their own databases, APIs, and failure modes.
The grafOS scheduler provides this as a library. No separate scheduler service to operate. Capacity tracking, quota enforcement, placement scoring, preemption, and attestation verification are all in-process calls against data structures you own.
What You Build
A multi-tenant batch compute system where:
CapacityLedgertracks per-node resources fromGET_INVENTORYAdmissionControllergates lease requests against available capacityQuotaManagerenforces per-tenant resource limitsPlacementScorerselects the best node for each jobPreemptionManagerevicts lower-priority work when high-priority jobs need resourcesAttestationVerifierensures nodes are running trusted firmware before schedulingCapBrokermints scoped capability tokens for authorized accessgrafos-batchTaskGraphorganizes work into DAG-structured jobs
Building Blocks
grafos_scheduler::{CapacityLedger, AdmissionController, QuotaManager, PlacementScorer, PreemptionManager, CapBroker, AttestationVerifier}— sourcegrafos_scheduler::{Priority, TenantId, AccountingTag, Strategy, AdmissionDecision}— sourcegrafos_core::PreemptionReason— closed-set typed reason enum with snake_caseas_str()wire vocabulary.grafos_audit::{assemble_record, AuditInput, AuditEventKind::Preempted}— hash-linked record per preemption decision carrying the typed reason viaAuditInput::preemption_reason.grafos_batch::{TaskGraph, TaskDef, ResourceReq}— sourcegrafos_observe— source
Design
Capacity Tracking
CapacityLedger maintains per-node totals (from GET_INVENTORY / ANNOUNCE), reserved capacity, and
leased capacity. When a node departs, its entry is removed and any leases on it are marked for preemption.
let mut ledger = CapacityLedger::new();ledger.register_node("node-1", NodeCapacity { mem_bytes: 8 * GB, cpu_cores: 16, gpu_vram: 0, block_bytes: 0,});Admission Control
AdmissionController checks reservation headroom and ledger free capacity before approving a lease request.
Returns AdmissionDecision::Approved or AdmissionDecision::Denied(reason).
Quota Enforcement
QuotaManager tracks per-tenant resource usage against configured limits. A tenant that has consumed its
quota gets QuotaDenied on subsequent requests. Quotas can limit memory, CPU, GPU, or any combination.
Placement Scoring
PlacementScorer ranks candidate nodes for a PlacementRequest. Strategies include RoundRobin, BestFit
(least waste), and weighted scoring via ScoreWeights. The scorer returns a ranked list of PlacementScore
entries.
Preemption
When a high-priority job cannot be admitted due to capacity, PreemptionManager identifies lower-priority
leases that can be evicted. It calls LeaseRevoker to revoke victim leases and returns a PreemptionResult
describing what was freed.
Typed Preemption Reason
Every preemption decision names its reason from the closed-set
PreemptionReason enum. SIEM rules and audit dashboards key on the
snake_case as_str() form. The full taxonomy:
| Variant | as_str() | When it fires |
|---|---|---|
PriorityPreemption | priority_preemption | Higher-priority admitted work reclaims resources from preemptible lower-priority work. |
QuotaRebalance | quota_rebalance | Tenant/project usage moved back inside its fair-share or quota envelope (typically because another tenant was admitted). |
BurstCreditExhausted | burst_credit_exhausted | The tenant’s token-bucket burst credit reached zero — no other tenant was admitted; the workload exceeded its own envelope. |
BudgetExhausted | budget_exhausted | Tenant budget/spend policy is exhausted; the work was configured as preemptible on budget exhaustion. |
CostCapEviction | cost_cap_eviction | Economics policy requires eviction because a hard cost cap can no longer be satisfied. |
OperatorDrain | operator_drain | An operator initiated node/cell drain (pairs with Recipe 58). |
OperatorMigProfileChange | operator_mig_profile_change | An operator initiated GPU MIG profile recompose. |
MaintenanceWindow | maintenance_window | Scheduled maintenance window evicting non-essential work. |
PolicyViolationRecovery | policy_violation_recovery | Policy enforcement action — e.g. attestation lapsed, security finding triggered eviction. |
A SIEM alert keyed on
reason == "burst_credit_exhausted" indicates tenant misbehavior;
one keyed on reason == "policy_violation_recovery" is a security
signal. The variants are intentionally distinct so neither bucket
hides inside the other.
Audit Chain Records
Every preemption decision seals an
AuditEventKind::Preempted record into the hash-linked audit chain,
carrying the typed reason in AuditInput::preemption_reason. Recipes
55 (collector) and 60 (tenant dashboard) consume these records to
reconstruct who-preempted-whom for compliance review and
billing-attribution analysis.
Attestation Gate
Before scheduling on a node, AttestationVerifier checks that the node’s firmware attestation is valid.
Supported verifiers: Ed25519Verifier, DiceVerifier, Tpm2Verifier, NitroVerifier. Nodes that fail
attestation are quarantined.
Walkthrough (Implementation Sketch)
1. Register Nodes and Tenants
use grafos_scheduler::*;
let mut ledger = CapacityLedger::new();ledger.register_node("node-1", NodeCapacity { mem_bytes: 8_000_000_000, cpu_cores: 16, gpu_vram: 0, block_bytes: 0,});ledger.register_node("node-2", NodeCapacity { mem_bytes: 16_000_000_000, cpu_cores: 32, gpu_vram: 0, block_bytes: 0,});
let mut quotas = QuotaManager::new();let team_a = TenantId("team-alpha".into());let team_b = TenantId("team-beta".into());quotas.set_quota(team_a.clone(), Quota { limits: vec![ResourceLimit::mem_bytes(12_000_000_000)],});quotas.set_quota(team_b.clone(), Quota { limits: vec![ResourceLimit::mem_bytes(8_000_000_000)],});2. Gate Admission
let mut admission = AdmissionController::new(&ledger);
let request = PlacementRequest { mem_bytes: 4_000_000_000, cpu_cores: 8, gpu_vram: 0, block_bytes: 0, tenant: team_a.clone(), priority: Priority::Normal, tag: AccountingTag("batch-etl".into()),};
// Check quota firstquotas.check(&team_a, &request)?;
// Then check admissionmatch admission.evaluate(&request) { AdmissionDecision::Approved(node_id) => { ledger.reserve(&node_id, &request); } AdmissionDecision::Denied(reason) => { // Try preemption or reject }}3. Score Placement
let scorer = PlacementScorer::new(Strategy::BestFit);let candidates = scorer.score(&ledger, &request);// candidates is sorted by fit score — least waste firstlet best = &candidates[0];4. Preempt Lower-Priority Work
let mut preemption = PreemptionManager::new(PreemptionConfig::default());
// High-priority job that cannot be admittedlet urgent = PlacementRequest { mem_bytes: 12_000_000_000, priority: Priority::Critical, // ... ..request};
match preemption.find_victims(&ledger, &urgent) { PreemptionResult::Preemptable(victims) => { for victim in victims { // Revoke the victim's lease via LeaseRevoker trait revoker.revoke(victim.lease_id)?; ledger.release(&victim.node_id, victim.resources); } // Now admit the urgent job } PreemptionResult::Impossible(reason) => { // Cannot free enough resources even with preemption }}4b. Seal the Preemption Into the Audit Chain
Every preemption decision should produce a typed audit-chain record naming the reason. Downstream consumers (Recipe 55’s collector, Recipe 60’s dashboard) read these records.
use cookbook_recipe_26_multi_tenant_preemption::{ dev_audit_anchor, record_preemption, PreemptionDecision,};use grafos_core::{PreemptionReason, WorkloadIdentity};
let (mut anchor, signer) = dev_audit_anchor();
let decision = PreemptionDecision { session_id: victim.lease_id as u128, tenant_identity: WorkloadIdentity::tenant_only("research"), reason: PreemptionReason::PriorityPreemption,};let record = record_preemption(&mut anchor, &signer, sequence, now_secs, &decision);
// `record.kind == AuditEventKind::Preempted`// `record.preemption_reason == Some(PreemptionReason::PriorityPreemption)`// `record.current_event_hash` chains to subsequent preemption records.5. Verify Attestation Before Scheduling
let verifier = Ed25519Verifier::new(trusted_keys);let gate = AttestationGate::new(Box::new(verifier));
match gate.verify(&node_attestation) { AttestationResult::Trusted => { /* schedule here */ } AttestationResult::Untrusted(err) => { // Quarantine the node — do not schedule }}6. Mint Capability Tokens
let broker = CapBroker::new(Box::new(hmac_minter));
let mint_req = MintRequest { tenant: team_a.clone(), resource_id: lease_id, operations: vec!["MEM_READ", "MEM_WRITE"], ttl_secs: 300,};
match broker.mint(&mint_req)? { MintOutcome::Issued(token) => { // Pass token to the workload for data-plane access } MintOutcome::Denied(reason) => { /* policy violation */ }}Failure Modes
- Node departs mid-job:
CapacityLedgerremoves the node. Jobs on that node fail withDisconnected. The batch executor retries on a different node usingTaskGraphretry policy. - Quota exhausted: new jobs from that tenant are rejected immediately. Existing jobs continue until their leases expire.
- Preemption cascade: a critical job preempts a normal job, which was itself preempting a low-priority
job. Guard against this with
PreemptionConfig::max_cascade_depth. - Attestation failure: node is quarantined. Existing leases on that node are revoked. This is intentionally disruptive — running on untrusted firmware is worse than a capacity reduction.
Observability
- Per-tenant resource usage and quota headroom.
- Admission approve/deny rates by tenant and priority.
- Preemption events with victim lease IDs, freed resources, and
typed
PreemptionReason— SIEM rules key on the snake_caseas_str()form (priority_preemption,quota_rebalance,burst_credit_exhausted, …) so each cause has its own alert channel. - Attestation verification results per node.
- Placement score distributions (helps tune
ScoreWeights). - Audit chain
Preemptedrecords: every preemption seals one record carrying the typed reason. The collector atcrates/grafos-audit-collectoringests these for compliance and billing-attribution analysis (see Recipes 55 and 60).
Variations
- Spot-style preemptible leases: tenants opt in to preemptible leases at lower cost, accepting that critical jobs may evict them
- Cross-cluster federation: multiple
CapacityLedgerinstances with a global admission controller - Metered billing:
EventLogrecords lease events withAccountingTagfor chargeback - Attestation refresh: periodically re-verify node attestation; quarantine nodes whose attestation becomes stale