Skip to content

Recipe 2: Computation That Survives a Power Outage

Situation

You have a long-running computation (Monte Carlo, ETL, model training, search indexing) that can run for minutes to hours. If the process or node dies near the end, you lose everything unless you checkpoint.

In conventional systems, checkpointing is often an afterthought:

  • Ad-hoc file writes to local disk.
  • A bespoke snapshot format.
  • A fragile set of “resume” flags.

In the grafOS model, you explicitly lease:

  • Fast working memory.
  • Persistent block storage.

The goal: build computation where failure cost is bounded by your checkpoint interval, not your total run time.

What You Build

A loop that:

  • Maintains a mutable state T.
  • Periodically serializes it into a durable block lease.
  • On startup, checks whether a checkpoint exists and resumes from it.

Building Blocks

  • grafos_collections::durable::Durable<T> for checkpoint/restore.
  • grafos_std::block::BlockBuilder for acquiring block storage leases.
  • serde (already used by Durable) for serializing state.

See also:

Design

Checkpoint Granularity

Checkpointing every iteration is safe but expensive. Choose an interval:

  • Every N iterations.
  • Every M seconds (requires a clock; in mocks you can control time).
  • Every time you finish a “work unit” that is naturally idempotent.

State Shape

Keep the checkpoint state compact and deterministic:

  • Random seeds
  • Current iteration index
  • Accumulators / partial results
  • Minimal metadata needed to resume

Avoid including large derived data you can recompute cheaply.

Crash Consistency

A checkpoint write can be interrupted. You want “all or nothing” restore semantics.

Durable<T> generally addresses this by writing a full serialized snapshot; in more advanced designs, you can add:

  • versioned header
  • length prefix
  • checksum / hash

If you later want multi-version rollback, keep two alternating slots (A/B) and write the new checkpoint to the inactive slot, then flip a pointer.

Walkthrough (Implementation Sketch)

1. Define Your State

use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, Serialize, Deserialize)]
struct SimState {
iter: u64,
rng_seed: u64,
sum: f64,
count: u64,
}

2. Acquire Block Storage Lease and Restore-or-Initialize

use grafos_collections::durable::Durable;
use grafos_std::block::BlockBuilder;
let block_lease = BlockBuilder::new().min_blocks(64).acquire()?;
// Either restore from an existing checkpoint lease, or create a new one.
// In practice you'll store the lease id somewhere stable; for the recipe we
// just demonstrate the restore API shape.
let mut d: Durable<SimState> = Durable::new(
SimState { iter: 0, rng_seed: 123, sum: 0.0, count: 0 },
block_lease,
);

To model “resume after crash”, see the example in crates/grafos-collections/examples/checkpoint_restore.rs: drop the Durable, acquire a new block lease handle (to the same underlying resource in a real system), and call Durable::restore.

3. Run Loop With Periodic Checkpoints

let checkpoint_every = 10_000u64;
let total_iters = 5_000_000u64;
while d.iter < total_iters {
// Do one work step (update d.inner_mut()).
// ...
d.inner_mut().iter += 1;
if d.iter % checkpoint_every == 0 {
d.checkpoint()?;
}
}
// Final checkpoint for good measure.
d.checkpoint()?;

4. Recovery

After a crash:

  1. Lease fresh working memory somewhere else (if needed).
  2. Restore state from block storage.
  3. Continue from state.iter.

The important conceptual split:

  • Working memory lease failure does not imply durable state failure.
  • Durable state can outlive any one node.

Failure Modes and Recovery

  • FabricError::Disconnected on checkpoint: treat as a hard error; you cannot prove durability.
    • Retry with backoff.
    • Optionally acquire a new block lease and continue with a new checkpoint target.
  • FabricError::LeaseExpired on block lease: the checkpoint resource is gone.
    • This is the equivalent of “your disk died.” Decide if the computation should abort or restart from scratch.

Observability

Track:

  • checkpoint_bytes_written
  • checkpoint_duration_ms
  • checkpoint_success_total / checkpoint_fail_total
  • resume_count_total

Also log checkpoint metadata: iteration number and checkpoint epoch.

Variations

  • Incremental checkpoints: log deltas instead of full snapshots, periodically compact.
  • Fan-out compute: multiple workers checkpoint to separate leases; a coordinator aggregates.
  • Deterministic replay: store only seeds and inputs, recompute everything deterministically on resume.

Testing

Run against a live dev fabric or a test cell with a durable FBBU lease.

  • Build a small loop.
  • Force a tasklet restart after checkpointing.
  • Restore from the surviving block lease and assert the iteration counter and accumulators match expected values.