Recipe 16: Pop-Up Supercomputer in 30 Seconds
Situation
You occasionally need huge parallel compute:
- a one-off simulation
- a deadline-driven batch job
- a “rerun everything” recovery operation
Provisioning a cluster for this can cost more time than the compute itself.
The lease model lets you treat the fabric as a pool:
- acquire lots of CPU quickly
- run
- disappear
This recipe is deliberately about the “unexpectedly cool” version of burst computing: the cluster exists only as long as you hold leases. There is no pre-provisioned worker fleet and no cleanup tail.
What You Build
A coordinator that:
- acquires many CPU leases (short TTL)
- dispatches tasklets for work chunks
- retries chunks on failure via
JobCoordinator - renews leases via
RenewalManager - releases all resources at the end
Building Blocks
grafos_jobs::{JobCoordinator, RetryPolicy}— sourcegrafos_leasekit::{RenewalManager, RenewalPolicy}— sourceCpuBuilderleasesBlockBuilderfor shared outputsgrafos_observefor lease lifecycle visibility
Related API docs:
Design
Idempotent Work Units
Each chunk should be safe to run multiple times. Implement WorkChunk for your chunk type with a
stable chunk_id() derived from inputs:
- index range
(start, end) - partition id
- input file hash + shard id
JobCoordinator uses the chunk ID to deduplicate results and ensure idempotent output capture.
Resource Elasticity
Acquire until:
- you hit the target parallelism, or
- acquisition fails (capacity pressure)
The coordinator is itself small; it can run anywhere.
Treat acquisition failure as a normal operating condition:
- if you can only acquire 30 cores, the job still runs, just slower
- if you can acquire more later, you can add leases mid-run
Lease Lifetimes and Renewal
Use RenewalManager to keep CPU leases alive during execution:
use grafos_leasekit::{RenewalManager, RenewalPolicy};
let mut renewal_mgr = RenewalManager::new();let policy = RenewalPolicy { renew_at_fraction: 0.70, // renew at 70% elapsed (30% remaining) ..RenewalPolicy::default()};
// Register each CPU leasefor (lease_id, expiry) in &cpu_leases { renewal_mgr.register(*lease_id, *expiry, policy);}
// Tick periodically during executionlet summary = renewal_mgr.tick(now);// summary.renewed, summary.failed, summary.expiredIf renewal fails, mark the worker as lost and requeue its chunks.
Retry Policy
RetryPolicy classifies errors: Disconnected and LeaseExpired are transient (retried),
everything else is permanent (fail closed):
use grafos_jobs::RetryPolicy;
let policy = RetryPolicy { max_retries: 3, initial_backoff_secs: 1, max_backoff_secs: 16,};Output Storage
If chunk outputs are small, you can return them in the tasklet output buffer. For anything non-trivial:
- allocate a block lease for job outputs
- write
chunk_id -> output_locatorinto a small index (could beFabricHashMapor a coordinator-local map)
This decouples worker lifetimes from output lifetimes.
Walkthrough (Implementation Sketch)
1. Choose Chunking
Pick chunk boundaries that give you:
- roughly uniform runtime per chunk
- deterministic inputs
- bounded output size
2. Acquire CPU Leases
Acquire in a loop until you hit your target:
CpuBuilder::new().cores(n).lease_secs(ttl).acquire()?;
3. Submit Tasklets
For each available lease:
- pop a chunk from the queue
- submit tasklet WASM with a fuel limit
- pass chunk descriptor as
input()
4. Handle Completion with JobCoordinator
Use JobCoordinator to orchestrate execution with automatic retry:
use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};
let mut store = MemoryOutputStore::new();let policy = RetryPolicy::default();let mut coord = JobCoordinator::new(policy);
let result = coord.run( &chunks, &mut store, |chunk_bytes| execute_chunk(chunk_bytes), // exec_fn |outputs| aggregate_results(outputs), // agg_fn)?;
// result.aggregate contains the final aggregated output// result.chunks_succeeded, result.chunks_failed for stats5. Drop Everything
Drop CPU leases and intermediate leases. Lease count should fall back to baseline immediately.
Observability
Your graph should show:
- rapid ramp to many leases
- flat during compute
- immediate drop to zero at completion
Add specific signals:
chunks_total,chunks_done,chunks_retry_totalcpu_leases_activetasklet_duration_mshistogram
Failure Modes
Disconnected: classified as transient byRetryPolicy; chunk is retried.LeaseExpired: also transient; requeue chunk.Fenced/CapacityExceeded: permanent errors; fail closed.- Coordinator crash: leases expire and reclaim automatically; rerun job from scratch or from durable checkpoints.
Variations
- multi-phase jobs: map -> reduce -> finalize
- heterogeneous leasing: request more cores for heavy chunks
- checkpoint coordinator state to block lease for resumable bursts