Skip to content

Recipe 16: Pop-Up Supercomputer in 30 Seconds

Situation

You occasionally need huge parallel compute:

  • a one-off simulation
  • a deadline-driven batch job
  • a “rerun everything” recovery operation

Provisioning a cluster for this can cost more time than the compute itself.

The lease model lets you treat the fabric as a pool:

  • acquire lots of CPU quickly
  • run
  • disappear

This recipe is deliberately about the “unexpectedly cool” version of burst computing: the cluster exists only as long as you hold leases. There is no pre-provisioned worker fleet and no cleanup tail.

What You Build

A coordinator that:

  • acquires many CPU leases (short TTL)
  • dispatches tasklets for work chunks
  • retries chunks on failure via JobCoordinator
  • renews leases via RenewalManager
  • releases all resources at the end

Building Blocks

  • grafos_jobs::{JobCoordinator, RetryPolicy}source
  • grafos_leasekit::{RenewalManager, RenewalPolicy}source
  • CpuBuilder leases
  • BlockBuilder for shared outputs
  • grafos_observe for lease lifecycle visibility

Related API docs:

Design

Idempotent Work Units

Each chunk should be safe to run multiple times. Implement WorkChunk for your chunk type with a stable chunk_id() derived from inputs:

  • index range (start, end)
  • partition id
  • input file hash + shard id

JobCoordinator uses the chunk ID to deduplicate results and ensure idempotent output capture.

Resource Elasticity

Acquire until:

  • you hit the target parallelism, or
  • acquisition fails (capacity pressure)

The coordinator is itself small; it can run anywhere.

Treat acquisition failure as a normal operating condition:

  • if you can only acquire 30 cores, the job still runs, just slower
  • if you can acquire more later, you can add leases mid-run

Lease Lifetimes and Renewal

Use RenewalManager to keep CPU leases alive during execution:

use grafos_leasekit::{RenewalManager, RenewalPolicy};
let mut renewal_mgr = RenewalManager::new();
let policy = RenewalPolicy {
renew_at_fraction: 0.70, // renew at 70% elapsed (30% remaining)
..RenewalPolicy::default()
};
// Register each CPU lease
for (lease_id, expiry) in &cpu_leases {
renewal_mgr.register(*lease_id, *expiry, policy);
}
// Tick periodically during execution
let summary = renewal_mgr.tick(now);
// summary.renewed, summary.failed, summary.expired

If renewal fails, mark the worker as lost and requeue its chunks.

Retry Policy

RetryPolicy classifies errors: Disconnected and LeaseExpired are transient (retried), everything else is permanent (fail closed):

use grafos_jobs::RetryPolicy;
let policy = RetryPolicy {
max_retries: 3,
initial_backoff_secs: 1,
max_backoff_secs: 16,
};

Output Storage

If chunk outputs are small, you can return them in the tasklet output buffer. For anything non-trivial:

  • allocate a block lease for job outputs
  • write chunk_id -> output_locator into a small index (could be FabricHashMap or a coordinator-local map)

This decouples worker lifetimes from output lifetimes.

Walkthrough (Implementation Sketch)

1. Choose Chunking

Pick chunk boundaries that give you:

  • roughly uniform runtime per chunk
  • deterministic inputs
  • bounded output size

2. Acquire CPU Leases

Acquire in a loop until you hit your target:

  • CpuBuilder::new().cores(n).lease_secs(ttl).acquire()?;

3. Submit Tasklets

For each available lease:

  • pop a chunk from the queue
  • submit tasklet WASM with a fuel limit
  • pass chunk descriptor as input()

4. Handle Completion with JobCoordinator

Use JobCoordinator to orchestrate execution with automatic retry:

use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};
let mut store = MemoryOutputStore::new();
let policy = RetryPolicy::default();
let mut coord = JobCoordinator::new(policy);
let result = coord.run(
&chunks,
&mut store,
|chunk_bytes| execute_chunk(chunk_bytes), // exec_fn
|outputs| aggregate_results(outputs), // agg_fn
)?;
// result.aggregate contains the final aggregated output
// result.chunks_succeeded, result.chunks_failed for stats

5. Drop Everything

Drop CPU leases and intermediate leases. Lease count should fall back to baseline immediately.

Observability

Your graph should show:

  • rapid ramp to many leases
  • flat during compute
  • immediate drop to zero at completion

Add specific signals:

  • chunks_total, chunks_done, chunks_retry_total
  • cpu_leases_active
  • tasklet_duration_ms histogram

Failure Modes

  • Disconnected: classified as transient by RetryPolicy; chunk is retried.
  • LeaseExpired: also transient; requeue chunk.
  • Fenced / CapacityExceeded: permanent errors; fail closed.
  • Coordinator crash: leases expire and reclaim automatically; rerun job from scratch or from durable checkpoints.

Variations

  • multi-phase jobs: map -> reduce -> finalize
  • heterogeneous leasing: request more cores for heavy chunks
  • checkpoint coordinator state to block lease for resumable bursts