Providers and cells
A grafOS cluster is a small set of providers, each with one or more cells. Programs deploy to a provider; the scheduler picks a cell within that provider that has free capacity and a healthy heartbeat; the cell holds the leases your tasklets run under.
If you’re used to Kubernetes, the rough analogy is provider ≈ cluster, cell ≈ node. The model differs because cells advertise capabilities and identity to the scheduler instead of being passively scheduled onto.
Providers
A provider is a cloud you can host cells on:
| Provider | What |
|---|---|
aws | AWS EC2 — customer-owned cloud connector or Tenura-managed via the same scheduler. |
gcp | GCP Compute Engine — same model. |
azure | Azure VMs — same model. |
Providers expose a runtime_readiness state that the scheduler uses to admit deploys. A provider with no healthy cells refuses lease requests with a typed runtime_not_ready error — the program never enters the placed state on a provider with no cell.
You see provider state directly via grafos provider list --json or live on the admin console at /admin/stats#infrastructure.
Cells
A cell is a process (typically grafos-cell-agent) running on a host that has registered with the scheduler. A cell:
- Holds an identity. Bound at first-boot via a Tenura-issued cert; verified on every register.
- Reports inventory. Tells the scheduler what resources it can host (CPU cores, memory, blocks, NICs, GPUs).
- Sends a heartbeat. Last-seen timestamp the scheduler uses for health.
- Receives admissions. When the scheduler places a lease here, the cell’s CapBroker mints the token and creates the data-plane binding.
- Tears down on revoke. Lease expiry / explicit revoke → cell tears down the binding → typed
LeaseRevokedevent flows back.
Cell state has six readiness requirements (visible via grafos provider status <p> --json):
code_enabled — provider is enabled in scheduler configdurable_provider_cell_record — cell is in the scheduler's persistent stateidentity_bound — cell has a verified identity certhealthy_cell — recent heartbeat, no stuck stateinventory_reported — cell has told us what it can hostoutbound_registered — cell can reach the scheduler outboundAll six must be true before the scheduler admits a lease to that cell.
How a cell joins a provider
- Bootstrap. Operator runs
grafos cloud bootstrap-cell --provider <p> --cell-id <id>against a provider-bootstrap JWT minted by the Tenura account API. - Identity bind. The cell receives a Tenura-issued client cert tied to its
cell_id. This is the credential it will present on every subsequent register. - Outbound register. The cell contacts
https://scheduler.grafos.tenura.systems/api/v1/cells/registerand announces itself. Scheduler creates a durable record. - Inventory report. Cell sends an inventory:
{ total_cpu, total_mem, total_block, gpu_count, ... }. - Heartbeat. Cell starts emitting periodic heartbeats. After ~30s of consistent heartbeats and inventory, the cell is
healthy_cell: true. - Conformance (optional). The scheduler can run a conformance suite against the cell that exercises the data-plane bindings end-to-end. Required for production cells.
When all six readiness signals are green, the cell is ready and the scheduler admits leases.
How a deploy gets placed
- Plan.
grafos deploy planresolves the project’s tasklets, computes their resource requirements, picks a provider (from--provider), and asks the scheduler for an admission decision. - Admit. The scheduler picks the highest-scored healthy cell within that provider. Scoring is locality-aware (prefer cells with free local memory rather than cross-rack memory) and quota-aware.
- Lease. The scheduler places leases on the chosen cell — one per declared resource. The cell mints capability tokens and returns connection coordinates.
- Run. The cell loads the tasklet WASM module under the leases, runs it, persists output artifacts, finalizes the run record.
- Teardown. When the run finishes (or fails, or hits its TTL), the leases tear down. The cell’s binding state returns to free.
The scheduler refusal codes worth knowing:
no_eligible_provider_cells— the provider has no cell that meets all readiness requirements.quota_exceeded— your tenant is at quota for one of the requested resources.placement_failed— every candidate cell was vetoed by isolation / locality / attestation requirements.
These are typed: programs see the refusal kind, log it, and surface to the operator.
Operating a cell
You almost never operate a cell directly as a program author. The scheduler manages cells; the operator uses grafos provider and grafos fabric subcommands. But for context:
grafos fabric cells --json— full cell list across providers.grafos provider revoke-cell <provider> <cell_id>— operator-initiated cell revocation. Drains leases, marks the cell unhealthy.- The admin console at
/admin/stats#infrastructureshows fleet shape live, including readiness reasons for each unhealthy cell.
Where to next
- Trust model — what each cell signs and what the scheduler verifies.
/spec/control-plane-tls-plan— the staged TLS integration story for cell ↔ scheduler.- The CLI reference under provider and fabric for operator commands.