Skip to content

Scheduler Isolation/Exclusivity Policy

Status: design decision. Commits how grafos-scheduler reasons about per-lease CPU isolation and GPU exclusivity during placement.

Addendum to: docs/spec/resource-isolation-and-exclusivity.md §6.3. Builds on: docs/spec/cpu-isolation-wire-format.md and docs/spec/gpu-exclusivity-wire-format.md. Joint with: the scheduler affinity engine — this note commits the filter/score pipeline shape that affinity work will share.


1. Problem

§6.3 names the three tradeoff axes (density, fairness, predictability) but grafos-scheduler has zero isolation-awareness: the existing PlacementScorer considers capacity, spread/binpack strategy, cache locality, and affinity — none of the isolation/exclusivity axes. The scheduler’s only existing isolation hits are KV-cache attach-exclusivity, which is unrelated (cache-lifetime ownership, not per-lease isolation policy).

This note commits the scheduler’s treatment of the new classes.

2. Decision: filter-then-score pipeline

The scheduler adopts a filter → score → adapt pipeline for placement, matching the shape the affinity engine uses. Isolation and affinity share the same pipeline machinery.

candidate nodes
┌──────────────────────┐
│ hard filters │ isolation class must be supportable
│ │ AND current load must permit it
│ + affinity required │ (affinity reuses this stage)
│ + tenant admission │
└──────────┬───────────┘
│ eligible candidates
┌──────────────────────┐
│ scorers │ existing spread / binpack / cache-aware
│ (existing + affinity │ + affinity score
│ score) │ (isolation is NOT a score dimension — it
│ │ is binary at the filter stage)
└──────────┬───────────┘
│ ranked candidates
┌──────────────────────┐
│ adapt / retry │ existing cell-retry path for transient
│ │ rejections (unchanged)
└──────────────────────┘

2.1 Why isolation is filter-only, not scored

Per-lease isolation classes are binary: a node either can guarantee StrictIsolated for a new lease right now, or it cannot. There is no meaningful “75% isolated” ranking. Treating isolation as a scoring dimension would invite the exact silent-weakening behavior design note §6.2 forbids.

Affinity, by contrast, admits a gradient (preferred node vs preferred rack vs anywhere), so it is legitimately scored. That’s why affinity contributes to the scorer stage while isolation contributes only to the filter stage.

3. Filter rules

For each candidate node, the isolation filter checks:

3.1 CPU isolation (TLV_LEASE_CPU_ISOLATION)

ClassFilter behavior
BestEffortNo filter (all nodes pass).
WholeCoreNode must advertise WholeCore support AND have ≥1 uncommitted whole core.
StrictIsolatedNode must advertise StrictIsolated support AND have a topology-isolable core available.

3.2 GPU exclusivity (TLV_LEASE_GPU_EXCLUSIVITY)

ClassFilter behavior
SharedNode’s daemon mode must permit sharing. Under --gpu-share-mode exclusive this is rejected per the GPU exclusivity wire format §4.
SessionExclusiveNode must support v1 GpuSession AND have a device currently unclaimed by another session.
DeviceExclusiveNode must have a GPU with zero active leases.
PartitionExclusiveNode must expose partitionable GPU (future MIG-as-class; no candidates today).

3.3 Node advertisement dependency

The filter cannot run until nodes advertise which classes they honor. That inventory work (per-class CPU and GPU advertisement) is deferred. Until inventory advertisement lands, the scheduler MUST assume each node supports only BestEffort / Shared, and MUST fail-closed-reject any placement request with a stricter class. This is intentional: fail-closed discipline from §6.2 applies to the scheduler itself, not just the node-side handler.

4. Interaction with tenant priority

Per-lease isolation is orthogonal to tenant::Priority tiers (Guaranteed / BestEffort / Scavenger). A Scavenger tenant may request StrictIsolated; a Guaranteed tenant may request BestEffort.

The interaction rule:

  • Tenant priority sets admission order and eviction order under contention (existing behavior, unchanged).
  • Isolation filter sets which nodes are eligible for a given request (new behavior from this note).

A Scavenger tenant requesting StrictIsolated is eligible for placement if nodes support the class, but may be preempted by a Guaranteed tenant requesting any class on the same node. Preemption behavior matches the existing per-cell preemption model — this note does not change it.

Naming collision warning: CpuIsolationPolicy::BestEffort (per-lease CPU isolation class) is UNRELATED to tenant::Priority::Standard (tenant admission tier). The scheduler must not conflate them. Renaming is a deferred hygiene cleanup.

5. Rejection reasons

Scheduler rejections for isolation-filter failures surface as structured events with distinct reason codes:

  • NoNodeSupportsClass — no candidate node advertises the requested class. Permanent (won’t retry without operator action).
  • NodesSupportButContended — nodes advertise the class but all are currently at capacity for that class. Transient (scheduler may retry via the existing cell-retry path).
  • ClassConflictsWithDaemonMode — e.g. Shared requested under --gpu-share-mode exclusive. Permanent from the client’s perspective; operator-action-required.
  • ClassConflictsWithResourceId — e.g. DeviceExclusive on a MIG sub-device. Permanent; client must correct the request.

Rejection events are emitted via the existing scheduler event stream. No new transport is added.

6. Shared pipeline with affinity

The scheduler affinity engine uses the same filter/score/adapt pipeline for hard and soft affinity constraints. This note commits that isolation uses the same pipeline, not a parallel one. Concretely:

  • The filter stage is a single Vec<Box<dyn PlacementFilter>>. Isolation contributes an IsolationFilter; affinity contributes an AffinityRequiredFilter.
  • The score stage is a single Vec<Box<dyn PlacementScorer>> of weighted scorers. Affinity contributes an AffinityScorer; isolation contributes nothing (isolation is binary, not scored).
  • The adapt stage is the existing cell-retry path.

Implementation order: the isolation work lands the pipeline skeleton plus IsolationFilter; the affinity work slots in AffinityRequiredFilter and AffinityScorer on top. Neither owns the pipeline alone.

7. What this note does NOT commit to

  • Implementation. The IsolationFilter, pipeline refactor, rejection-reason event variants, and tests are deferred to a separate follow-on wave.
  • Affinity filter/scorer details. This note commits the pipeline shape; the affinity specifics belong to the affinity work.
  • Preemption policy changes. Existing per-cell preemption is unchanged.
  • Cross-cell placement based on isolation. The orchestrator continues to route by cell summaries without isolation-awareness; isolation is enforced at the cell scheduler. The orchestrator MAY grow isolation-aware cell summary fields in a later wave; out of scope here.
  • Rename of attach_exclusive / cache-exclusive vocabulary to disambiguate from per-lease exclusivity. Hygiene follow-on, not this note.
  • docs/spec/resource-isolation-and-exclusivity.md §6.3
  • docs/spec/cpu-isolation-wire-format.md
  • docs/spec/gpu-exclusivity-wire-format.md
  • docs/spec/hierarchical-scheduler.md — cell/orchestrator split; isolation is cell-local
  • The existing score_nodes entry point the filter stage precedes
  • The scheduler affinity engine shares this pipeline