Skip to content

Providers and cells

A grafOS cluster is a small set of providers, each with one or more cells. Programs deploy to a provider; the scheduler picks a cell within that provider that has free capacity and a healthy heartbeat; the cell holds the leases your tasklets run under.

If you’re used to Kubernetes, the rough analogy is provider ≈ cluster, cell ≈ node. The model differs because cells advertise capabilities and identity to the scheduler instead of being passively scheduled onto.

Providers

A provider is a cloud you can host cells on:

ProviderWhat
awsAWS EC2 — customer-owned cloud connector or Tenura-managed via the same scheduler.
gcpGCP Compute Engine — same model.
azureAzure VMs — same model.

Providers expose a runtime_readiness state that the scheduler uses to admit deploys. A provider with no healthy cells refuses lease requests with a typed runtime_not_ready error — the program never enters the placed state on a provider with no cell.

You see provider state directly via grafos provider list --json or live on the admin console at /admin/stats#infrastructure.

Cells

A cell is a process (typically grafos-cell-agent) running on a host that has registered with the scheduler. A cell:

  • Holds an identity. Bound at first-boot via a Tenura-issued cert; verified on every register.
  • Reports inventory. Tells the scheduler what resources it can host (CPU cores, memory, blocks, NICs, GPUs).
  • Sends a heartbeat. Last-seen timestamp the scheduler uses for health.
  • Receives admissions. When the scheduler places a lease here, the cell’s CapBroker mints the token and creates the data-plane binding.
  • Tears down on revoke. Lease expiry / explicit revoke → cell tears down the binding → typed LeaseRevoked event flows back.

Cell state has six readiness requirements (visible via grafos provider status <p> --json):

code_enabled — provider is enabled in scheduler config
durable_provider_cell_record — cell is in the scheduler's persistent state
identity_bound — cell has a verified identity cert
healthy_cell — recent heartbeat, no stuck state
inventory_reported — cell has told us what it can host
outbound_registered — cell can reach the scheduler outbound

All six must be true before the scheduler admits a lease to that cell.

How a cell joins a provider

  1. Bootstrap. Operator runs grafos cloud bootstrap-cell --provider <p> --cell-id <id> against a provider-bootstrap JWT minted by the Tenura account API.
  2. Identity bind. The cell receives a Tenura-issued client cert tied to its cell_id. This is the credential it will present on every subsequent register.
  3. Outbound register. The cell contacts https://scheduler.grafos.tenura.systems/api/v1/cells/register and announces itself. Scheduler creates a durable record.
  4. Inventory report. Cell sends an inventory: { total_cpu, total_mem, total_block, gpu_count, ... }.
  5. Heartbeat. Cell starts emitting periodic heartbeats. After ~30s of consistent heartbeats and inventory, the cell is healthy_cell: true.
  6. Conformance (optional). The scheduler can run a conformance suite against the cell that exercises the data-plane bindings end-to-end. Required for production cells.

When all six readiness signals are green, the cell is ready and the scheduler admits leases.

How a deploy gets placed

  1. Plan. grafos deploy plan resolves the project’s tasklets, computes their resource requirements, picks a provider (from --provider), and asks the scheduler for an admission decision.
  2. Admit. The scheduler picks the highest-scored healthy cell within that provider. Scoring is locality-aware (prefer cells with free local memory rather than cross-rack memory) and quota-aware.
  3. Lease. The scheduler places leases on the chosen cell — one per declared resource. The cell mints capability tokens and returns connection coordinates.
  4. Run. The cell loads the tasklet WASM module under the leases, runs it, persists output artifacts, finalizes the run record.
  5. Teardown. When the run finishes (or fails, or hits its TTL), the leases tear down. The cell’s binding state returns to free.

The scheduler refusal codes worth knowing:

  • no_eligible_provider_cells — the provider has no cell that meets all readiness requirements.
  • quota_exceeded — your tenant is at quota for one of the requested resources.
  • placement_failed — every candidate cell was vetoed by isolation / locality / attestation requirements.

These are typed: programs see the refusal kind, log it, and surface to the operator.

Operating a cell

You almost never operate a cell directly as a program author. The scheduler manages cells; the operator uses grafos provider and grafos fabric subcommands. But for context:

  • grafos fabric cells --json — full cell list across providers.
  • grafos provider revoke-cell <provider> <cell_id> — operator-initiated cell revocation. Drains leases, marks the cell unhealthy.
  • The admin console at /admin/stats#infrastructure shows fleet shape live, including readiness reasons for each unhealthy cell.

Where to next