Recipe 2: Computation That Survives a Power Outage
Situation
You have a long-running computation (Monte Carlo, ETL, model training, search indexing) that can run for minutes to hours. If the process or node dies near the end, you lose everything unless you checkpoint.
In conventional systems, checkpointing is often an afterthought:
- Ad-hoc file writes to local disk.
- A bespoke snapshot format.
- A fragile set of “resume” flags.
In the grafOS model, you explicitly lease:
- Fast working memory.
- Persistent block storage.
The goal: build computation where failure cost is bounded by your checkpoint interval, not your total run time.
What You Build
A loop that:
- Maintains a mutable state
T. - Periodically serializes it into a durable block lease.
- On startup, checks whether a checkpoint exists and resumes from it.
Building Blocks
grafos_collections::durable::Durable<T>for checkpoint/restore.grafos_std::block::BlockBuilderfor acquiring block storage leases.serde(already used byDurable) for serializing state.
See also:
- Durable checkpoint/restore example
- grafos-collections guide (Durable)
- grafos-std block API (source)
- grafos-collections README
Design
Checkpoint Granularity
Checkpointing every iteration is safe but expensive. Choose an interval:
- Every N iterations.
- Every M seconds (requires a clock; in mocks you can control time).
- Every time you finish a “work unit” that is naturally idempotent.
State Shape
Keep the checkpoint state compact and deterministic:
- Random seeds
- Current iteration index
- Accumulators / partial results
- Minimal metadata needed to resume
Avoid including large derived data you can recompute cheaply.
Crash Consistency
A checkpoint write can be interrupted. You want “all or nothing” restore semantics.
Durable<T> generally addresses this by writing a full serialized snapshot; in more advanced designs, you can add:
- versioned header
- length prefix
- checksum / hash
If you later want multi-version rollback, keep two alternating slots (A/B) and write the new checkpoint to the inactive slot, then flip a pointer.
Walkthrough (Implementation Sketch)
1. Define Your State
use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, Serialize, Deserialize)]struct SimState { iter: u64, rng_seed: u64, sum: f64, count: u64,}2. Acquire Block Storage Lease and Restore-or-Initialize
use grafos_collections::durable::Durable;use grafos_std::block::BlockBuilder;
let block_lease = BlockBuilder::new().min_blocks(64).acquire()?;
// Either restore from an existing checkpoint lease, or create a new one.// In practice you'll store the lease id somewhere stable; for the recipe we// just demonstrate the restore API shape.let mut d: Durable<SimState> = Durable::new( SimState { iter: 0, rng_seed: 123, sum: 0.0, count: 0 }, block_lease,);To model “resume after crash”, see the example in crates/grafos-collections/examples/checkpoint_restore.rs:
drop the Durable, acquire a new block lease handle (to the same underlying resource in a real system),
and call Durable::restore.
3. Run Loop With Periodic Checkpoints
let checkpoint_every = 10_000u64;let total_iters = 5_000_000u64;
while d.iter < total_iters { // Do one work step (update d.inner_mut()). // ...
d.inner_mut().iter += 1;
if d.iter % checkpoint_every == 0 { d.checkpoint()?; }}
// Final checkpoint for good measure.d.checkpoint()?;4. Recovery
After a crash:
- Lease fresh working memory somewhere else (if needed).
- Restore state from block storage.
- Continue from
state.iter.
The important conceptual split:
- Working memory lease failure does not imply durable state failure.
- Durable state can outlive any one node.
Failure Modes and Recovery
FabricError::Disconnectedon checkpoint: treat as a hard error; you cannot prove durability.- Retry with backoff.
- Optionally acquire a new block lease and continue with a new checkpoint target.
FabricError::LeaseExpiredon block lease: the checkpoint resource is gone.- This is the equivalent of “your disk died.” Decide if the computation should abort or restart from scratch.
Observability
Track:
checkpoint_bytes_writtencheckpoint_duration_mscheckpoint_success_total/checkpoint_fail_totalresume_count_total
Also log checkpoint metadata: iteration number and checkpoint epoch.
Variations
- Incremental checkpoints: log deltas instead of full snapshots, periodically compact.
- Fan-out compute: multiple workers checkpoint to separate leases; a coordinator aggregates.
- Deterministic replay: store only seeds and inputs, recompute everything deterministically on resume.
Testing
Run against a live dev fabric or a test cell with a durable FBBU lease.
- Build a small loop.
- Force a tasklet restart after checkpointing.
- Restore from the surviving block lease and assert the iteration counter and accumulators match expected values.