Recipe 16: Pop-Up Supercomputer in 30 Seconds

Situation

You occasionally need huge parallel compute:

a one-off simulation
a deadline-driven batch job
a “rerun everything” recovery operation

Provisioning a cluster for this can cost more time than the compute itself.

The lease model lets you treat the fabric as a pool:

acquire lots of CPU quickly
run
disappear

This recipe is deliberately about the “unexpectedly cool” version of burst computing: the cluster exists only as long as you hold leases. There is no pre-provisioned worker fleet and no cleanup tail.

What You Build

A coordinator that:

acquires many CPU leases (short TTL)
dispatches tasklets for work chunks
retries chunks on failure via JobCoordinator
renews leases via RenewalManager
releases all resources at the end

Building Blocks

grafos_jobs::{JobCoordinator, RetryPolicy} — source
grafos_leasekit::{RenewalManager, RenewalPolicy} — source
CpuBuilder leases
BlockBuilder for shared outputs
grafos_observe for lease lifecycle visibility

Related API docs:

Design

Idempotent Work Units

Each chunk should be safe to run multiple times. Implement WorkChunk for your chunk type with a stable chunk_id() derived from inputs:

index range (start, end)
partition id
input file hash + shard id

JobCoordinator uses the chunk ID to deduplicate results and ensure idempotent output capture.

Resource Elasticity

Acquire until:

you hit the target parallelism, or
acquisition fails (capacity pressure)

The coordinator is itself small; it can run anywhere.

Treat acquisition failure as a normal operating condition:

if you can only acquire 30 cores, the job still runs, just slower
if you can acquire more later, you can add leases mid-run

Lease Lifetimes and Renewal

Use RenewalManager to keep CPU leases alive during execution:

use grafos_leasekit::{RenewalManager, RenewalPolicy};

let mut renewal_mgr = RenewalManager::new();
let policy = RenewalPolicy {
    renew_at_fraction: 0.70,  // renew at 70% elapsed (30% remaining)
    ..RenewalPolicy::default()
};

// Register each CPU lease
for (lease_id, expiry) in &cpu_leases {
    renewal_mgr.register(*lease_id, *expiry, policy);
}

// Tick periodically during execution
let summary = renewal_mgr.tick(now);
// summary.renewed, summary.failed, summary.expired

If renewal fails, mark the worker as lost and requeue its chunks.

Retry Policy

RetryPolicy classifies errors: Disconnected and LeaseExpired are transient (retried), everything else is permanent (fail closed):

use grafos_jobs::RetryPolicy;

let policy = RetryPolicy {
    max_retries: 3,
    initial_backoff_secs: 1,
    max_backoff_secs: 16,
};

Output Storage

If chunk outputs are small, you can return them in the tasklet output buffer. For anything non-trivial:

allocate a block lease for job outputs
write chunk_id -> output_locator into a small index (could be FabricHashMap or a coordinator-local map)

This decouples worker lifetimes from output lifetimes.

Walkthrough (Implementation Sketch)

1. Choose Chunking

Pick chunk boundaries that give you:

roughly uniform runtime per chunk
deterministic inputs
bounded output size

2. Acquire CPU Leases

Acquire in a loop until you hit your target:

CpuBuilder::new().cores(n).lease_secs(ttl).acquire()?;

3. Submit Tasklets

For each available lease:

pop a chunk from the queue
submit tasklet WASM with a fuel limit
pass chunk descriptor as input()

4. Handle Completion with JobCoordinator

Use JobCoordinator to orchestrate execution with automatic retry:

use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};

let mut store = MemoryOutputStore::new();
let policy = RetryPolicy::default();
let mut coord = JobCoordinator::new(policy);

let result = coord.run(
    &chunks,
    &mut store,
    |chunk_bytes| execute_chunk(chunk_bytes),  // exec_fn
    |outputs| aggregate_results(outputs),       // agg_fn
)?;

// result.aggregate contains the final aggregated output
// result.chunks_succeeded, result.chunks_failed for stats

5. Drop Everything

Drop CPU leases and intermediate leases. Lease count should fall back to baseline immediately.

Observability

Your graph should show:

rapid ramp to many leases
flat during compute
immediate drop to zero at completion

Add specific signals:

chunks_total, chunks_done, chunks_retry_total
cpu_leases_active
tasklet_duration_ms histogram

Failure Modes

Disconnected: classified as transient by RetryPolicy; chunk is retried.
LeaseExpired: also transient; requeue chunk.
Fenced / CapacityExceeded: permanent errors; fail closed.
Coordinator crash: leases expire and reclaim automatically; rerun job from scratch or from durable checkpoints.

Variations

multi-phase jobs: map -> reduce -> finalize
heterogeneous leasing: request more cores for heavy chunks
checkpoint coordinator state to block lease for resumable bursts