Skip to content

Recipe 26: Multi-Tenant Compute With Preemption and Attestation

Situation

Multiple teams share a compute fabric. Without policy enforcement, one team can starve others. You need admission control, per-tenant quotas, preemption of lower-priority work, and attestation that nodes are running trusted firmware. Traditional schedulers are complex infrastructure projects with their own databases, APIs, and failure modes.

The grafOS scheduler provides this as a library. No separate scheduler service to operate. Capacity tracking, quota enforcement, placement scoring, preemption, and attestation verification are all in-process calls against data structures you own.

What You Build

A multi-tenant batch compute system where:

  • CapacityLedger tracks per-node resources from GET_INVENTORY
  • AdmissionController gates lease requests against available capacity
  • QuotaManager enforces per-tenant resource limits
  • PlacementScorer selects the best node for each job
  • PreemptionManager evicts lower-priority work when high-priority jobs need resources
  • AttestationVerifier ensures nodes are running trusted firmware before scheduling
  • CapBroker mints scoped capability tokens for authorized access
  • grafos-batch TaskGraph organizes work into DAG-structured jobs

Building Blocks

  • grafos_scheduler::{CapacityLedger, AdmissionController, QuotaManager, PlacementScorer, PreemptionManager, CapBroker, AttestationVerifier}source
  • grafos_scheduler::{Priority, TenantId, AccountingTag, Strategy, AdmissionDecision}source
  • grafos_core::PreemptionReason — closed-set typed reason enum with snake_case as_str() wire vocabulary.
  • grafos_audit::{assemble_record, AuditInput, AuditEventKind::Preempted} — hash-linked record per preemption decision carrying the typed reason via AuditInput::preemption_reason.
  • grafos_batch::{TaskGraph, TaskDef, ResourceReq}source
  • grafos_observesource

Design

Capacity Tracking

CapacityLedger maintains per-node totals (from GET_INVENTORY / ANNOUNCE), reserved capacity, and leased capacity. When a node departs, its entry is removed and any leases on it are marked for preemption.

let mut ledger = CapacityLedger::new();
ledger.register_node("node-1", NodeCapacity {
mem_bytes: 8 * GB,
cpu_cores: 16,
gpu_vram: 0,
block_bytes: 0,
});

Admission Control

AdmissionController checks reservation headroom and ledger free capacity before approving a lease request. Returns AdmissionDecision::Approved or AdmissionDecision::Denied(reason).

Quota Enforcement

QuotaManager tracks per-tenant resource usage against configured limits. A tenant that has consumed its quota gets QuotaDenied on subsequent requests. Quotas can limit memory, CPU, GPU, or any combination.

Placement Scoring

PlacementScorer ranks candidate nodes for a PlacementRequest. Strategies include RoundRobin, BestFit (least waste), and weighted scoring via ScoreWeights. The scorer returns a ranked list of PlacementScore entries.

Preemption

When a high-priority job cannot be admitted due to capacity, PreemptionManager identifies lower-priority leases that can be evicted. It calls LeaseRevoker to revoke victim leases and returns a PreemptionResult describing what was freed.

Typed Preemption Reason

Every preemption decision names its reason from the closed-set PreemptionReason enum. SIEM rules and audit dashboards key on the snake_case as_str() form. The full taxonomy:

Variantas_str()When it fires
PriorityPreemptionpriority_preemptionHigher-priority admitted work reclaims resources from preemptible lower-priority work.
QuotaRebalancequota_rebalanceTenant/project usage moved back inside its fair-share or quota envelope (typically because another tenant was admitted).
BurstCreditExhaustedburst_credit_exhaustedThe tenant’s token-bucket burst credit reached zero — no other tenant was admitted; the workload exceeded its own envelope.
BudgetExhaustedbudget_exhaustedTenant budget/spend policy is exhausted; the work was configured as preemptible on budget exhaustion.
CostCapEvictioncost_cap_evictionEconomics policy requires eviction because a hard cost cap can no longer be satisfied.
OperatorDrainoperator_drainAn operator initiated node/cell drain (pairs with Recipe 58).
OperatorMigProfileChangeoperator_mig_profile_changeAn operator initiated GPU MIG profile recompose.
MaintenanceWindowmaintenance_windowScheduled maintenance window evicting non-essential work.
PolicyViolationRecoverypolicy_violation_recoveryPolicy enforcement action — e.g. attestation lapsed, security finding triggered eviction.

A SIEM alert keyed on reason == "burst_credit_exhausted" indicates tenant misbehavior; one keyed on reason == "policy_violation_recovery" is a security signal. The variants are intentionally distinct so neither bucket hides inside the other.

Audit Chain Records

Every preemption decision seals an AuditEventKind::Preempted record into the hash-linked audit chain, carrying the typed reason in AuditInput::preemption_reason. Recipes 55 (collector) and 60 (tenant dashboard) consume these records to reconstruct who-preempted-whom for compliance review and billing-attribution analysis.

Attestation Gate

Before scheduling on a node, AttestationVerifier checks that the node’s firmware attestation is valid. Supported verifiers: Ed25519Verifier, DiceVerifier, Tpm2Verifier, NitroVerifier. Nodes that fail attestation are quarantined.

Walkthrough (Implementation Sketch)

1. Register Nodes and Tenants

use grafos_scheduler::*;
let mut ledger = CapacityLedger::new();
ledger.register_node("node-1", NodeCapacity {
mem_bytes: 8_000_000_000,
cpu_cores: 16,
gpu_vram: 0,
block_bytes: 0,
});
ledger.register_node("node-2", NodeCapacity {
mem_bytes: 16_000_000_000,
cpu_cores: 32,
gpu_vram: 0,
block_bytes: 0,
});
let mut quotas = QuotaManager::new();
let team_a = TenantId("team-alpha".into());
let team_b = TenantId("team-beta".into());
quotas.set_quota(team_a.clone(), Quota {
limits: vec![ResourceLimit::mem_bytes(12_000_000_000)],
});
quotas.set_quota(team_b.clone(), Quota {
limits: vec![ResourceLimit::mem_bytes(8_000_000_000)],
});

2. Gate Admission

let mut admission = AdmissionController::new(&ledger);
let request = PlacementRequest {
mem_bytes: 4_000_000_000,
cpu_cores: 8,
gpu_vram: 0,
block_bytes: 0,
tenant: team_a.clone(),
priority: Priority::Normal,
tag: AccountingTag("batch-etl".into()),
};
// Check quota first
quotas.check(&team_a, &request)?;
// Then check admission
match admission.evaluate(&request) {
AdmissionDecision::Approved(node_id) => {
ledger.reserve(&node_id, &request);
}
AdmissionDecision::Denied(reason) => {
// Try preemption or reject
}
}

3. Score Placement

let scorer = PlacementScorer::new(Strategy::BestFit);
let candidates = scorer.score(&ledger, &request);
// candidates is sorted by fit score — least waste first
let best = &candidates[0];

4. Preempt Lower-Priority Work

let mut preemption = PreemptionManager::new(PreemptionConfig::default());
// High-priority job that cannot be admitted
let urgent = PlacementRequest {
mem_bytes: 12_000_000_000,
priority: Priority::Critical,
// ...
..request
};
match preemption.find_victims(&ledger, &urgent) {
PreemptionResult::Preemptable(victims) => {
for victim in victims {
// Revoke the victim's lease via LeaseRevoker trait
revoker.revoke(victim.lease_id)?;
ledger.release(&victim.node_id, victim.resources);
}
// Now admit the urgent job
}
PreemptionResult::Impossible(reason) => {
// Cannot free enough resources even with preemption
}
}

4b. Seal the Preemption Into the Audit Chain

Every preemption decision should produce a typed audit-chain record naming the reason. Downstream consumers (Recipe 55’s collector, Recipe 60’s dashboard) read these records.

use cookbook_recipe_26_multi_tenant_preemption::{
dev_audit_anchor, record_preemption, PreemptionDecision,
};
use grafos_core::{PreemptionReason, WorkloadIdentity};
let (mut anchor, signer) = dev_audit_anchor();
let decision = PreemptionDecision {
session_id: victim.lease_id as u128,
tenant_identity: WorkloadIdentity::tenant_only("research"),
reason: PreemptionReason::PriorityPreemption,
};
let record = record_preemption(&mut anchor, &signer, sequence, now_secs, &decision);
// `record.kind == AuditEventKind::Preempted`
// `record.preemption_reason == Some(PreemptionReason::PriorityPreemption)`
// `record.current_event_hash` chains to subsequent preemption records.

5. Verify Attestation Before Scheduling

let verifier = Ed25519Verifier::new(trusted_keys);
let gate = AttestationGate::new(Box::new(verifier));
match gate.verify(&node_attestation) {
AttestationResult::Trusted => { /* schedule here */ }
AttestationResult::Untrusted(err) => {
// Quarantine the node — do not schedule
}
}

6. Mint Capability Tokens

let broker = CapBroker::new(Box::new(hmac_minter));
let mint_req = MintRequest {
tenant: team_a.clone(),
resource_id: lease_id,
operations: vec!["MEM_READ", "MEM_WRITE"],
ttl_secs: 300,
};
match broker.mint(&mint_req)? {
MintOutcome::Issued(token) => {
// Pass token to the workload for data-plane access
}
MintOutcome::Denied(reason) => { /* policy violation */ }
}

Failure Modes

  • Node departs mid-job: CapacityLedger removes the node. Jobs on that node fail with Disconnected. The batch executor retries on a different node using TaskGraph retry policy.
  • Quota exhausted: new jobs from that tenant are rejected immediately. Existing jobs continue until their leases expire.
  • Preemption cascade: a critical job preempts a normal job, which was itself preempting a low-priority job. Guard against this with PreemptionConfig::max_cascade_depth.
  • Attestation failure: node is quarantined. Existing leases on that node are revoked. This is intentionally disruptive — running on untrusted firmware is worse than a capacity reduction.

Observability

  • Per-tenant resource usage and quota headroom.
  • Admission approve/deny rates by tenant and priority.
  • Preemption events with victim lease IDs, freed resources, and typed PreemptionReason — SIEM rules key on the snake_case as_str() form (priority_preemption, quota_rebalance, burst_credit_exhausted, …) so each cause has its own alert channel.
  • Attestation verification results per node.
  • Placement score distributions (helps tune ScoreWeights).
  • Audit chain Preempted records: every preemption seals one record carrying the typed reason. The collector at crates/grafos-audit-collector ingests these for compliance and billing-attribution analysis (see Recipes 55 and 60).

Variations

  • Spot-style preemptible leases: tenants opt in to preemptible leases at lower cost, accepting that critical jobs may evict them
  • Cross-cluster federation: multiple CapacityLedger instances with a global admission controller
  • Metered billing: EventLog records lease events with AccountingTag for chargeback
  • Attestation refresh: periodically re-verify node attestation; quarantine nodes whose attestation becomes stale