Scheduler Isolation/Exclusivity Policy
Status: design decision. Commits how
grafos-schedulerreasons about per-lease CPU isolation and GPU exclusivity during placement.Addendum to:
docs/spec/resource-isolation-and-exclusivity.md§6.3. Builds on:docs/spec/cpu-isolation-wire-format.mdanddocs/spec/gpu-exclusivity-wire-format.md. Joint with: the scheduler affinity engine — this note commits the filter/score pipeline shape that affinity work will share.
1. Problem
§6.3 names the three tradeoff axes (density, fairness, predictability)
but grafos-scheduler has zero isolation-awareness: the existing
PlacementScorer considers capacity, spread/binpack strategy, cache
locality, and affinity — none of the isolation/exclusivity axes. The
scheduler’s only existing isolation hits are KV-cache
attach-exclusivity, which is unrelated (cache-lifetime ownership, not
per-lease isolation policy).
This note commits the scheduler’s treatment of the new classes.
2. Decision: filter-then-score pipeline
The scheduler adopts a filter → score → adapt pipeline for placement, matching the shape the affinity engine uses. Isolation and affinity share the same pipeline machinery.
candidate nodes │ ▼┌──────────────────────┐│ hard filters │ isolation class must be supportable│ │ AND current load must permit it│ + affinity required │ (affinity reuses this stage)│ + tenant admission │└──────────┬───────────┘ │ eligible candidates ▼┌──────────────────────┐│ scorers │ existing spread / binpack / cache-aware│ (existing + affinity │ + affinity score│ score) │ (isolation is NOT a score dimension — it│ │ is binary at the filter stage)└──────────┬───────────┘ │ ranked candidates ▼┌──────────────────────┐│ adapt / retry │ existing cell-retry path for transient│ │ rejections (unchanged)└──────────────────────┘2.1 Why isolation is filter-only, not scored
Per-lease isolation classes are binary: a node either can
guarantee StrictIsolated for a new lease right now, or it cannot.
There is no meaningful “75% isolated” ranking. Treating isolation as a
scoring dimension would invite the exact silent-weakening behavior
design note §6.2 forbids.
Affinity, by contrast, admits a gradient (preferred node vs preferred rack vs anywhere), so it is legitimately scored. That’s why affinity contributes to the scorer stage while isolation contributes only to the filter stage.
3. Filter rules
For each candidate node, the isolation filter checks:
3.1 CPU isolation (TLV_LEASE_CPU_ISOLATION)
| Class | Filter behavior |
|---|---|
BestEffort | No filter (all nodes pass). |
WholeCore | Node must advertise WholeCore support AND have ≥1 uncommitted whole core. |
StrictIsolated | Node must advertise StrictIsolated support AND have a topology-isolable core available. |
3.2 GPU exclusivity (TLV_LEASE_GPU_EXCLUSIVITY)
| Class | Filter behavior |
|---|---|
Shared | Node’s daemon mode must permit sharing. Under --gpu-share-mode exclusive this is rejected per the GPU exclusivity wire format §4. |
SessionExclusive | Node must support v1 GpuSession AND have a device currently unclaimed by another session. |
DeviceExclusive | Node must have a GPU with zero active leases. |
PartitionExclusive | Node must expose partitionable GPU (future MIG-as-class; no candidates today). |
3.3 Node advertisement dependency
The filter cannot run until nodes advertise which classes they honor.
That inventory work (per-class CPU and GPU advertisement) is deferred.
Until inventory advertisement lands, the scheduler MUST assume each
node supports only BestEffort / Shared, and MUST fail-closed-reject
any placement request with a stricter class. This is intentional:
fail-closed discipline from §6.2 applies to the scheduler itself, not
just the node-side handler.
4. Interaction with tenant priority
Per-lease isolation is orthogonal to tenant::Priority tiers
(Guaranteed / BestEffort / Scavenger). A Scavenger tenant may request
StrictIsolated; a Guaranteed tenant may request BestEffort.
The interaction rule:
- Tenant priority sets admission order and eviction order under contention (existing behavior, unchanged).
- Isolation filter sets which nodes are eligible for a given request (new behavior from this note).
A Scavenger tenant requesting StrictIsolated is eligible for
placement if nodes support the class, but may be preempted by a
Guaranteed tenant requesting any class on the same node. Preemption
behavior matches the existing per-cell preemption model — this note
does not change it.
Naming collision warning: CpuIsolationPolicy::BestEffort (per-lease
CPU isolation class) is UNRELATED to tenant::Priority::Standard
(tenant admission tier). The scheduler must not conflate them. Renaming
is a deferred hygiene cleanup.
5. Rejection reasons
Scheduler rejections for isolation-filter failures surface as structured events with distinct reason codes:
NoNodeSupportsClass— no candidate node advertises the requested class. Permanent (won’t retry without operator action).NodesSupportButContended— nodes advertise the class but all are currently at capacity for that class. Transient (scheduler may retry via the existing cell-retry path).ClassConflictsWithDaemonMode— e.g.Sharedrequested under--gpu-share-mode exclusive. Permanent from the client’s perspective; operator-action-required.ClassConflictsWithResourceId— e.g.DeviceExclusiveon a MIG sub-device. Permanent; client must correct the request.
Rejection events are emitted via the existing scheduler event stream. No new transport is added.
6. Shared pipeline with affinity
The scheduler affinity engine uses the same filter/score/adapt pipeline for hard and soft affinity constraints. This note commits that isolation uses the same pipeline, not a parallel one. Concretely:
- The filter stage is a single
Vec<Box<dyn PlacementFilter>>. Isolation contributes anIsolationFilter; affinity contributes anAffinityRequiredFilter. - The score stage is a single
Vec<Box<dyn PlacementScorer>>of weighted scorers. Affinity contributes anAffinityScorer; isolation contributes nothing (isolation is binary, not scored). - The adapt stage is the existing cell-retry path.
Implementation order: the isolation work lands the pipeline skeleton
plus IsolationFilter; the affinity work slots in
AffinityRequiredFilter and AffinityScorer on top. Neither owns
the pipeline alone.
7. What this note does NOT commit to
- Implementation. The
IsolationFilter, pipeline refactor, rejection-reason event variants, and tests are deferred to a separate follow-on wave. - Affinity filter/scorer details. This note commits the pipeline shape; the affinity specifics belong to the affinity work.
- Preemption policy changes. Existing per-cell preemption is unchanged.
- Cross-cell placement based on isolation. The orchestrator continues to route by cell summaries without isolation-awareness; isolation is enforced at the cell scheduler. The orchestrator MAY grow isolation-aware cell summary fields in a later wave; out of scope here.
- Rename of
attach_exclusive/ cache-exclusive vocabulary to disambiguate from per-lease exclusivity. Hygiene follow-on, not this note.
8. Cross-links
docs/spec/resource-isolation-and-exclusivity.md§6.3docs/spec/cpu-isolation-wire-format.mddocs/spec/gpu-exclusivity-wire-format.mddocs/spec/hierarchical-scheduler.md— cell/orchestrator split; isolation is cell-local- The existing
score_nodesentry point the filter stage precedes - The scheduler affinity engine shares this pipeline