Skip to content

Recipe 15: Observing Everything (The Meta-Recipe)

Situation

Production failures are rarely “one thing”. In systems with multiple resource types, you usually need:

  • CPU / memory / storage / network metrics;
  • Causality (what happened around the spike?);
  • A durable, tamper-evident record (what was the actual state at the moment the operator paged us?).

In a traditional stack these come from different libraries and different namespaces. In grafOS the lease API is the choke point for resource use, and there are three correlated observability layers above it. Instrument the lease layer once and you observe everything that uses leases.

What You Build

A unified observability surface across three layers, all keyed on the same lease lifecycle events:

LayerAPILoss profileBest for
Metricsgrafos_observe::FabricMetricsaggregate-only”is this happening a lot?”
Eventsgrafos_observe::EventRingBuffer + FabricEventlossy (bounded ring)“what was happening around the spike?”
Audit chaingrafos_audit::assemble_record + AuditRecordtamper-evident, durable when anchor persists”exactly what was sealed at that moment?”

Every lease lifecycle event hits all three. Recipes 55 (Consuming the Audit Chain) and 60 (Tenant Audit Dashboard) consume the chain this recipe produces.

The compiled recipe lives in cookbook/recipe-15-observing-everything.

Building Blocks

  • grafos_observe::{FabricMetrics, EventRingBuffer, FabricEvent, OpType, ResourceType}
  • grafos_observe::prometheus::PrometheusExporter (feature-gated)
  • grafos_observe::json_log::JsonEventSink (feature-gated)
  • grafos_audit::{assemble_record, AnchorStore, AuditInput, AuditRecord, Signer, NullSigner, MemoryAnchorStore}
  • grafos_core::{AuditEventKind, WorkloadIdentity}

See:

Design

Layer 1: Metrics

Aggregate counters and histograms. Fast, lossy, cheap to scrape.

At minimum:

  • active leases by resource type
  • acquire/drop counts
  • operation latency histograms (read / write / submit)
  • error counts by FabricError variant

Layer 2: Events

Typed FabricEvent records in a bounded ring buffer. The shape:

FabricEvent::LeaseAcquired { resource_type, lease_id, node, bytes, trace_id }
FabricEvent::LeaseExpired { resource_type, lease_id, node }
FabricEvent::Disconnected { node, reason }
... (closed-set enum)

Lossy by design — when the ring wraps, the oldest events drop. The ring is for short-window in-process correlation: “what happened in the 60 seconds around the spike?”.

Layer 3: Audit chain

grafos_audit::assemble_record produces an AuditRecord with a typed AuditEventKind (lease_allocated, lease_expired, preempted, edge_rewritten, …) and the prior chain head sealed into a SHA-256 hash. Tamper-evident: any byte-level change after seal breaks verify_chain. The AnchorStore persists the chain head atomically; on restart the consumer resumes from the persisted anchor.

Three properties the audit chain has that the metrics / events layers do not:

  • Hash linkage: no record can be added, removed, reordered, or modified after seal without detection.
  • Typed payloads: AuditEventData::{LeasePreempted, RevokeStateTransition, BundleAdmissionDecided, EdgeRewritten, ...} carry the structured fields downstream consumers pattern-match on.
  • Cross-process readable: serialize to JSONL via grafos_audit::jsonl::write_record, ingest via the reference collector at crates/grafos-audit-collector.

Correlation

Every event in every layer should carry, where applicable:

  • request_id / trace_id (W3C traceparent, see docs/operations/scheduler-features.md § “Trace context propagation”)
  • lease_id
  • node_id

A SIEM filter or operator query then walks across all three layers on the same key.

Program

use cookbook_recipe_15_observing_everything::{
dev_observability, record_lease_acquired, record_lease_expired, record_operation,
};
use grafos_audit::verify_chain;
use grafos_core::WorkloadIdentity;
use grafos_observe::{OpType, ResourceType};
let (metrics, mut events, mut anchor, signer) = dev_observability();
let identity = WorkloadIdentity::tenant_only("acme");
let r1 = record_lease_acquired(
&metrics, &mut events, &mut anchor, &signer,
/*sequence*/ 1, /*timestamp*/ 1_700_000_000,
&identity, ResourceType::Mem, /*lease_id*/ 7, "node-a", /*bytes*/ 4096,
);
// ... workload runs ...
record_operation(&metrics, OpType::Read, 120, 4096);
record_operation(&metrics, OpType::Write, 240, 8192);
// Lease expires later in the day.
let r2 = record_lease_expired(
&metrics, &mut events, &mut anchor, &signer,
2, 1_700_000_300,
&identity, ResourceType::Mem, 7, "node-a",
);
// All three layers reflect the same lifecycle.
assert_eq!(metrics.leases_total.get(), 1);
assert_eq!(events.len(), 2);
verify_chain(&[r1, r2], [0u8; grafos_audit::HASH_LEN]).expect("chain verifies");

Debugging Example: Periodic Latency Spike

Symptom:

  • p99 jumps every ~300 seconds.

Investigation across the three layers:

  1. Metrics layer: leases_expired shows bursts at the same interval. Confirms it’s a lease lifecycle event, not a network glitch.
  2. Events layer: ring buffer shows LeaseExpired → LeaseAcquired pairs at the same lease_id every 300s. Confirms the renewal pattern, not new lease churn.
  3. Audit chain: grafos admin audit-query --kind lease_expired --since <t> (or Recipe 55’s collector) returns the sealed records with sequence numbers and timestamps. Confirms what was durably recorded at the producer.

Root cause: application renews leases at 99% of TTL — when the network adds even a small RTT, the renewal lands after expiry.

Fix: renew at 60-80% of TTL.

Verification:

  • LeaseExpired events → ~0 in steady state on the metrics layer;
  • ring buffer shows no LeaseExpired between checkpoints;
  • audit chain shows only LeaseAllocated + LeaseRenewed for long-running workloads.

Failure Modes

  • Metrics layer overflow: counters saturate at u64::MAX; histograms drop outliers above the configured ceiling. Both are expected.
  • Events layer ring wrap: oldest events drop when the buffer is full. Sizing matters — set the ring to cover the longest correlation window you expect.
  • Audit chain tamper: verify_chain returns the failing record index. The reference collector at crates/grafos-audit-collector increments chain_verification_failures and refuses to advance the anchor; production callers route this to an alert.
  • Anchor corruption: if the persisted anchor file is missing or corrupt, FileAnchorStore::load_or_unanchored returns a fresh sentinel — producers continue but the chain restarts. Production callers persist the anchor to a side store before the producer process exits.

Tests

Run it with:

Terminal window
cargo test -p cookbook-recipe-15-observing-everything

Two tests cover the full lease lifecycle hitting all three layers (metrics counters move, events ring records two entries, audit chain produces two hash-linked records that verify_chain accepts) and the data-plane operation path going only to metrics (per-byte traffic does not enter the chain).

Adaptation Notes

  • Production signer: replace NullSigner with grafos_audit::Ed25519Signer once Phase 220 wires the custody-matrix-named process. Until then NullSigner produces unsigned records; verify_chain accepts them.
  • Production anchor: replace MemoryAnchorStore with FileAnchorStore::load_or_unanchored(path) so the chain resumes from a known head after a process restart.
  • Alert rules: build alerts on the SIEM-stable counters per docs/operations/siem-vocabulary-cookbook.md — e.g. “grafos_audit_records_lease_expired_total rate over 5m > N” fires when the renewal pattern breaks.
  • Per-lease cost tracking: pair with Recipe 59 (Cost Attribution with Accounting Tags) — every lease event carries a typed AccountingTag in canonical bytes, so cost rollups use the same surface this recipe produces.
  • Tenant-side dashboard: pair with Recipe 60 (Tenant Audit Dashboard) — consumes the chain this recipe produces and projects it into operator-readable views.

Variations

  • Trace context propagation: thread a W3C traceparent into every lease event so all three layers carry the same trace_id. The FabricEvent::LeaseAcquired { trace_id, .. } field is already wired for this; the audit-chain layer carries the trace context inside the typed EdgeRecord payload for edge_rewritten records.
  • Replay from chain: with the audit chain persisted, an operator can reconstruct lease-lifecycle history without trusting the metrics or events surface — verify_chain proves the records weren’t tampered with after sealing.

See also:

  • Recipe 55 (Consuming the Audit Chain) — the downstream collector pattern.
  • Recipe 60 (Tenant Audit Dashboard) — the operator-facing projection.
  • Recipe 59 (Cost Attribution With Accounting Tags) — cost rollups keyed on the same lease events.