Hierarchical Scheduler Architecture
Related reading:
docs/spec/resource-isolation-and-exclusivity.mddefines the three-axis resource model (capacity / execution mode / isolation-exclusivity) that cell schedulers will need to reason about as Phase 48.5 lands. This scheduler architecture document describes cell-level placement and cross-cell routing; the isolation-and- exclusivity note describes the per-resource policy classes that those schedulers will filter and score on. Today the scheduler is not yet isolation-aware (the design note’s §6.3 density / predictability / exclusivity tradeoff axes are not implemented ingrafos-scheduler); that work is tracked as Phase 48.5 task #68.
Overview
The fabricBIOS scheduler architecture is hierarchical: multiple independent cell schedulers manage their own fleets, while a lightweight orchestrator routes lease requests to the best cell.
┌──────────────┐ │ Orchestrator │ routes lease requests │ (stateless) │ aggregates cell summaries └──────┬───────┘ │ ┌────────────┼────────────┐ │ │ │ ┌──────┴──────┐ ┌──┴───────┐ ┌──┴───────┐ │ Cell A │ │ Cell B │ │ Cell C │ │ scheduler-a │ │ sched-b │ │ sched-c │ │ epoch=3 │ │ epoch=1 │ │ epoch=2 │ └──────┬──────┘ └──┬───────┘ └──┬───────┘ │ │ │ ┌───┴───┐ ┌───┴───┐ ┌───┴───┐ │nodes │ │nodes │ │nodes │ │a1..aN │ │b1..bN │ │c1..cN │ └───────┘ └───────┘ └───────┘Cell Model
A cell is an independent scheduling domain:
- Has its own leader epoch, WAL, and reconciliation loop
- Manages a specific set of nodes (discovered via ANNOUNCE or static config)
- Handles admission, placement, token minting, and preemption locally
- Failures are isolated — one cell’s issues don’t affect others
When multiple cells share the same L2 segment (e.g. macvlan on a common
NIC), each cell uses --node-subnet to filter ANNOUNCE discovery to its
own IP range. Rules are CIDR-based with ! prefix for exclusion, evaluated
in order (first match wins). This prevents cells from absorbing each
other’s nodes.
What is cell-local
| State | Cell-local? | Why |
|---|---|---|
| Leader epoch | Yes | Each cell has independent election |
| Managed nodes | Yes | Nodes belong to one cell |
| Pending/confirmed/unattributed leases | Yes | Leases are on cell-local nodes |
| WAL | Yes | Persists cell-local state |
| Promotion/election | Yes | Cell-local leader |
| Preemption | Yes | Only within cell’s nodes |
| Reconciliation | Yes | Polls cell-local nodes |
What is global (at orchestrator)
| State | Why global |
|---|---|
| Tenant registry | Tenants span cells |
| Global quotas | Quota budgets span cells |
| Cell summaries | Orchestrator reads these for routing |
Cell Summary Contract
Each cell scheduler exposes GET /api/v1/cell/summary:
{ "cell_id": 1, "role": "active", "leader_epoch": 3, "nodes": 4, "resources": [ {"resource_type": 2, "total": 8589934592, "available": 6442450944}, {"resource_type": 1, "total": 16, "available": 12} ], "pending_count": 2, "confirmed_count": 15, "unattributed_count": 0, "admissions": 142, "denials": 3, "healthy": true}The orchestrator polls this every 5-10 seconds. Stale or unreachable cells are excluded from placement.
Two-Level Placement
- Orchestrator receives
POST /api/v1/orchestrate/lease - Checks global tenant quota
- Selects best cell from summaries (spread strategy: most available)
- Forwards
POST /api/v1/leaseto chosen cell - Cell scheduler performs real admission, token minting, placement
- Returns token + node to client (via orchestrator)
If the chosen cell denies (capacity changed since summary), the orchestrator retries the next-best cell (up to 3 attempts).
Quota Across Cells
Default: globally budgeted, cell-local enforcement.
The orchestrator maintains a global quota per tenant. Each cell’s summary reports its lease count and capacity usage. The orchestrator checks the global quota before routing. The cell scheduler does not enforce global quotas — it only enforces local capacity.
This is eventually consistent: a tenant could briefly exceed their global quota if two cells admit simultaneously. The orchestrator detects this on the next summary poll and stops routing to that tenant until usage drops.
Preemption Policy
Preemption is strictly cell-local:
- A cell scheduler can preempt lower-priority leases within its own nodes
- The orchestrator never triggers cross-cell preemption
- If a cell is full and preemption doesn’t free enough capacity, the orchestrator routes to a different cell
This is intentional: cross-cell preemption would require the orchestrator to understand individual node leases, breaking the abstraction. If cross-cell preemption is needed in the future, it requires a separate design with explicit coordination between cell schedulers.
Failure Isolation
- One cell’s WAL corruption → only that cell replays/recovers
- One cell’s election failure → only that cell is excluded from routing
- One cell’s node reboot → only that cell reconciles
- Orchestrator failure → cells continue operating independently (clients can connect directly to cell schedulers if they know the cell URL)
Deployment
Each cell runs one grafos-scheduler-service binary with --cell-id:
# Cell Agrafos-scheduler-service --bind 0.0.0.0:9100 --cell-id 1 \ --election-node 10.10.0.169:5701 --state-dir /var/lib/scheduler-a
# Cell Bgrafos-scheduler-service --bind 0.0.0.0:9100 --cell-id 2 \ --election-node 10.10.0.180:5701 --state-dir /var/lib/scheduler-bThe orchestrator connects to all cells:
grafos-orchestrator --bind 0.0.0.0:9200 \ --cells http://10.10.0.20:9100,http://10.10.0.30:9100Scale Envelope
| Dimension | Target |
|---|---|
| Nodes per cell | 100-1000 |
| Leases per cell | 10,000+ |
| Cells per orchestrator | 10-100 |
| Orchestrator fan-out | < 100ms per cell poll |
| Cell summary size | < 1 KiB |