Skip to content

Token-Gating Policy

This document defines which fabricBIOS control-plane operations require capability tokens and what verification steps each target must perform.

Design Principles

  1. Every state-mutating operation that creates or extends resource access must be token-gated. This includes LEASE_ALLOC, LEASE_RENEW, TASKLET_SUBMIT, and TASKLET_LEASE_ALLOC.

  2. Cleanup operations (LEASE_FREE) are intentionally ungated. The lease_id itself is the authorization — any holder of a lease_id can free it. This ensures cleanup always works, even when tokens have expired or the minting node is unreachable.

  3. Read-only operations are ungated. PING, GET_IDENTITY, GET_INVENTORY, GET_BUILD_INFO, GET_THERMAL, LEASE_QUERY, and LEASE_LIST_ACTIVE do not require tokens. LEASE_LIST_ACTIVE reveals allocation patterns but is needed for scheduler reconciliation — gating it would break the scheduler’s ability to detect drift.

  4. Nodes are the token authority. Each node mints and verifies its own tokens using its own signing key. The scheduler does not hold node keys and cannot forge tokens — it obtains them by calling CAP_REQUEST on the target node, which decides whether to mint based on fencing, epoch, and local policy. This is the fabricBIOS principle: the node is mechanism (issues tokens, enforces leases), the scheduler is policy (decides who gets what). A compromised scheduler can only obtain tokens from nodes it can reach, and the node’s fencing/epoch checks constrain even that.

    The scheduler always mints tokens via CAP_REQUEST on the target node. After TIME_SYNC, all targets use Unix time, so tokens are technically portable across nodes. However, the design intentionally keeps token minting node-local: no key distribution, no PKI, no trust in external signers. The node remains the sole authority over its own resources.

Operation Token Requirements

OperationToken RequiredRationale
LEASE_ALLOC (0x0200)YesCreates a new resource lease — must prove authorization
LEASE_RENEW (0x0202)YesExtends lease lifetime — state mutation equivalent to creation
LEASE_FREE (0x0201)NoCleanup — lease_id is the auth; must always work
LEASE_QUERY (0x0203)NoRead-only query of lease status
LEASE_LIST_ACTIVE (0x0208)NoRead-only; needed by scheduler reconciliation
TASKLET_SUBMIT (0x0500)YesSubmits code for execution on a CPU lease
TASKLET_STATUS (0x0501)NoRead-only query of tasklet status
TASKLET_FETCH_RESULT (0x0502)NoRead-only fetch of tasklet output
TASKLET_CANCEL (0x0503)NoCleanup equivalent — tasklet_id is the auth
TASKLET_LEASE_ALLOC (0x0800)YesCreates composite CPU+MEM lease
TASKLET_LEASE_FREE (0x0801)NoDelegates to LEASE_FREE
GPU_SUBMIT (0x0600)NoGPU lease already required token to create; GPU_SUBMIT operates within that lease. The lease-exists check is sufficient.
CAP_REQUEST (0x0100)NoToken minting — this IS the authorization primitive

Token Verification Steps (required for all token-gated ops)

Both bare-metal and fabricbiosd must perform these steps in order:

  1. Empty check — reject if req.token.is_empty()
  2. DecodeCapabilityToken::decode(&req.token) → reject on parse failure
  3. Signature verificationtoken.verify_signature(&verify_key) → reject if invalid
  4. Time boundstoken.verify_time_bounds(now, max_ttl) → reject if expired or future
  5. Audiencetoken.verify_audience(presenter) → reject if audience doesn’t match (audience=0 is wildcard, accepted by any presenter)
  6. Permissionstoken.permissions must include the required permission for the op (WRITE for LEASE_ALLOC/RENEW, WRITE for TASKLET_SUBMIT)
  7. Revocation — check token_id against revocation cache → reject if revoked
  8. Caveats — verify any caveats attached to the token (source IP, time, range, etc.)

GPU_SUBMIT Rationale

GPU_SUBMIT is intentionally ungated by a capability token. The reasoning:

  • Creating a GPU lease (via LEASE_ALLOC) requires a token. The lease-exists check in GPU_SUBMIT confirms the caller has an active GPU lease.
  • GPU_SUBMIT is analogous to FBMU WRITE — it operates within the scope of an existing lease, not creating new resource access.
  • Adding a token requirement to GPU_SUBMIT would require the caller to hold both a lease AND a separate submission token, which adds complexity without meaningfully improving security (the lease already proves authorization).

Clock Semantics

  • fabricbiosd: issued_at and expires_at are Unix timestamps (seconds since 1970-01-01). verify_time_bounds() uses SystemTime::now(). TIME_SYNC is accepted (no-op, returns current time).

  • Bare-metal (after TIME_SYNC): issued_at and expires_at are Unix timestamps, derived from monotonic_ticks + offset where offset was set by the first TIME_SYNC from a trusted peer. Tokens are compatible with fabricbiosd tokens.

  • Bare-metal (before TIME_SYNC): Token-gated operations (CAP_REQUEST, LEASE_ALLOC, LEASE_RENEW, TASKLET_SUBMIT) return TimeNotSynced (0x000B). Read-only operations (PING, GET_INVENTORY, GET_THERMAL) work without time sync. The node is honest: “I don’t know what time it is.”

  • Lease durations: Use lease_now() which returns Unix time after TIME_SYNC. Lease expires_at is on the same time base as tokens.

  • Token portability: After TIME_SYNC, all targets use Unix time. Tokens are technically portable across nodes. However, the design intentionally keeps minting node-local — the node is the token authority (see Design Principle 4).

Bare-metal clock source chain

On Raspberry Pi 5 bare-metal:

  1. Boot: ARM generic timer starts counting from 0.
  2. First QUIC client sends TIME_SYNC with its Unix time.
  3. set_unix_time(peer_secs): computes offset = peer_secs - monotonic.
  4. now_unix_secs(): returns monotonic + offset (real Unix time).
  5. Before TIME_SYNC: now_unix_secs() returns Err(Unsupported), token-gated ops return TimeNotSynced (0x000B).

The Pi5 has an RTC in the DA9091 PMIC, but it is not accessible from bare-metal code (no firmware mailbox tag exposed). TIME_SYNC provides the same function — a trusted peer sets the clock on first connection.

Why not scheduler-minted tokens?

The scheduler could theoretically sign tokens itself (saving a QUIC round-trip per lease). This is not done because:

  • Key distribution: the scheduler would need each node’s signing key, or a separate CA that nodes trust. Both add complexity and attack surface.
  • Separation of mechanism and policy: the node is the authority over its own resources. The scheduler proves authorization by successfully calling CAP_REQUEST — the node decides whether to mint.
  • Compromise containment: a compromised scheduler can only get tokens from nodes it can reach, and the node’s fencing/epoch checks limit even that. If the scheduler held signing keys, it could forge unlimited tokens.

This aligns with the fabricBIOS principle: nodes expose resources and enforce access control; policy lives above.