Skip to content

Hierarchical Scheduler Architecture

Related reading: docs/spec/resource-isolation-and-exclusivity.md defines the three-axis resource model (capacity / execution mode / isolation-exclusivity) that cell schedulers will need to reason about as Phase 48.5 lands. This scheduler architecture document describes cell-level placement and cross-cell routing; the isolation-and- exclusivity note describes the per-resource policy classes that those schedulers will filter and score on. Today the scheduler is not yet isolation-aware (the design note’s §6.3 density / predictability / exclusivity tradeoff axes are not implemented in grafos-scheduler); that work is tracked as Phase 48.5 task #68.

Overview

The fabricBIOS scheduler architecture is hierarchical: multiple independent cell schedulers manage their own fleets, while a lightweight orchestrator routes lease requests to the best cell.

┌──────────────┐
│ Orchestrator │ routes lease requests
│ (stateless) │ aggregates cell summaries
└──────┬───────┘
┌────────────┼────────────┐
│ │ │
┌──────┴──────┐ ┌──┴───────┐ ┌──┴───────┐
│ Cell A │ │ Cell B │ │ Cell C │
│ scheduler-a │ │ sched-b │ │ sched-c │
│ epoch=3 │ │ epoch=1 │ │ epoch=2 │
└──────┬──────┘ └──┬───────┘ └──┬───────┘
│ │ │
┌───┴───┐ ┌───┴───┐ ┌───┴───┐
│nodes │ │nodes │ │nodes │
│a1..aN │ │b1..bN │ │c1..cN │
└───────┘ └───────┘ └───────┘

Cell Model

A cell is an independent scheduling domain:

  • Has its own leader epoch, WAL, and reconciliation loop
  • Manages a specific set of nodes (discovered via ANNOUNCE or static config)
  • Handles admission, placement, token minting, and preemption locally
  • Failures are isolated — one cell’s issues don’t affect others

When multiple cells share the same L2 segment (e.g. macvlan on a common NIC), each cell uses --node-subnet to filter ANNOUNCE discovery to its own IP range. Rules are CIDR-based with ! prefix for exclusion, evaluated in order (first match wins). This prevents cells from absorbing each other’s nodes.

What is cell-local

StateCell-local?Why
Leader epochYesEach cell has independent election
Managed nodesYesNodes belong to one cell
Pending/confirmed/unattributed leasesYesLeases are on cell-local nodes
WALYesPersists cell-local state
Promotion/electionYesCell-local leader
PreemptionYesOnly within cell’s nodes
ReconciliationYesPolls cell-local nodes

What is global (at orchestrator)

StateWhy global
Tenant registryTenants span cells
Global quotasQuota budgets span cells
Cell summariesOrchestrator reads these for routing

Cell Summary Contract

Each cell scheduler exposes GET /api/v1/cell/summary:

{
"cell_id": 1,
"role": "active",
"leader_epoch": 3,
"nodes": 4,
"resources": [
{"resource_type": 2, "total": 8589934592, "available": 6442450944},
{"resource_type": 1, "total": 16, "available": 12}
],
"pending_count": 2,
"confirmed_count": 15,
"unattributed_count": 0,
"admissions": 142,
"denials": 3,
"healthy": true
}

The orchestrator polls this every 5-10 seconds. Stale or unreachable cells are excluded from placement.

Two-Level Placement

  1. Orchestrator receives POST /api/v1/orchestrate/lease
  2. Checks global tenant quota
  3. Selects best cell from summaries (spread strategy: most available)
  4. Forwards POST /api/v1/lease to chosen cell
  5. Cell scheduler performs real admission, token minting, placement
  6. Returns token + node to client (via orchestrator)

If the chosen cell denies (capacity changed since summary), the orchestrator retries the next-best cell (up to 3 attempts).

Quota Across Cells

Default: globally budgeted, cell-local enforcement.

The orchestrator maintains a global quota per tenant. Each cell’s summary reports its lease count and capacity usage. The orchestrator checks the global quota before routing. The cell scheduler does not enforce global quotas — it only enforces local capacity.

This is eventually consistent: a tenant could briefly exceed their global quota if two cells admit simultaneously. The orchestrator detects this on the next summary poll and stops routing to that tenant until usage drops.

Preemption Policy

Preemption is strictly cell-local:

  • A cell scheduler can preempt lower-priority leases within its own nodes
  • The orchestrator never triggers cross-cell preemption
  • If a cell is full and preemption doesn’t free enough capacity, the orchestrator routes to a different cell

This is intentional: cross-cell preemption would require the orchestrator to understand individual node leases, breaking the abstraction. If cross-cell preemption is needed in the future, it requires a separate design with explicit coordination between cell schedulers.

Failure Isolation

  • One cell’s WAL corruption → only that cell replays/recovers
  • One cell’s election failure → only that cell is excluded from routing
  • One cell’s node reboot → only that cell reconciles
  • Orchestrator failure → cells continue operating independently (clients can connect directly to cell schedulers if they know the cell URL)

Deployment

Each cell runs one grafos-scheduler-service binary with --cell-id:

Terminal window
# Cell A
grafos-scheduler-service --bind 0.0.0.0:9100 --cell-id 1 \
--election-node 10.10.0.169:5701 --state-dir /var/lib/scheduler-a
# Cell B
grafos-scheduler-service --bind 0.0.0.0:9100 --cell-id 2 \
--election-node 10.10.0.180:5701 --state-dir /var/lib/scheduler-b

The orchestrator connects to all cells:

Terminal window
grafos-orchestrator --bind 0.0.0.0:9200 \
--cells http://10.10.0.20:9100,http://10.10.0.30:9100

Scale Envelope

DimensionTarget
Nodes per cell100-1000
Leases per cell10,000+
Cells per orchestrator10-100
Orchestrator fan-out< 100ms per cell poll
Cell summary size< 1 KiB