Recipe 6: A Build System That Scales to 100 Cores in 200 Milliseconds
Situation
You have a build (or test) workload that is embarrassingly parallel:
- compile units
- code generation
- lint / formatting
- test shards
Traditional “burst capacity” approaches (VMs, containers, k8s) have non-trivial startup and teardown:
- Seconds to minutes of overhead.
- A cleanup problem when coordinators die mid-run.
In a lease-based system, you want:
- acquire CPU capacity quickly
- run work in a sandbox
- return all resources automatically on drop or TTL expiry
The goal: make the “build farm” a transient resource graph rather than a fleet.
What You Build
A coordinator that:
- Leases CPU resources across multiple nodes (short TTL).
- Dispatches compilation/test work as WASM tasklets with fuel limits.
- Writes intermediate artifacts (object files, test logs) to leased block storage.
- Drops all leases at the end, leaving no cleanup tail.
Building Blocks
grafos_std::cpu::CpuBuilderandCpuLease- WASM tasklets (the payload you submit)
grafos_std::block::BlockBuilderfor shared artifact storagegrafos_jobs::{JobCoordinator, RetryPolicy}for idempotent retry and dispatch — sourcegrafos_leasekit::RenewalManagerfor CPU lease renewal — sourcegrafos_observefor measuring lease churn and throughput
See:
Design
Work Decomposition
You need deterministic chunking:
- For compilation: per-crate or per-module units.
- For tests: shard list of tests by hash.
Each chunk must be:
- idempotent (safe to retry)
- bounded in time (fuel / TTL)
Isolation Model
Each task runs as WASM:
- fuel-limited to bound CPU usage
- bounded output size
This gives you container-like isolation without a container runtime.
Artifact Storage
Intermediate artifacts must outlive any single worker:
- store in block leases
- coordinator can read them back and link/aggregate
In more advanced designs, you might use a content-addressed store. For the recipe, a simple “object file per chunk” layout works.
Walkthrough (Implementation Sketch)
1. Coordinator Acquires CPU Leases
use grafos_std::cpu::CpuBuilder;
let mut workers = Vec::new();for _ in 0..50 { let lease = CpuBuilder::new().cores(2).lease_secs(120).acquire()?; workers.push(lease);}You now have 100 cores worth of leased capacity, but you do not have a fleet to manage. You have 50 lease handles.
2. Coordinator Acquires Block Lease for Artifacts
use grafos_std::block::BlockBuilder;let artifacts = BlockBuilder::new().min_blocks(4096).lease_secs(600).acquire()?;3. Dispatch Work as Tasklets
Each worker launches a WASM tasklet:
let result = workers[i].cpu() .submit(tasklet_wasm_bytes) .fuel(5_000_000) .input(chunk_descriptor_bytes) .launch()?;The tasklet writes its output to block storage (or returns it in the output buffer, if small).
4. Retry and Failure Handling
Use JobCoordinator with a RetryPolicy to handle worker failures declaratively:
use grafos_jobs::{JobCoordinator, RetryPolicy, Backoff};
let policy = RetryPolicy::default() .with_max_retries(3) .with_backoff(Backoff::exponential(100, 5000));
let mut coordinator = JobCoordinator::new(policy);
let result = coordinator.run( chunks, // work items artifacts, // shared block store for outputs |chunk, store| { // exec_fn: run one chunk let lease = CpuBuilder::new().cores(2).lease_secs(120).acquire()?; let r = lease.cpu().submit(tasklet_wasm_bytes).fuel(5_000_000) .input(chunk).launch()?; store.write(chunk.id, &r.output)?; Ok(r) }, |results| { // agg_fn: combine outputs Ok(results) },)?;JobCoordinator retries failed chunks with exponential backoff. If a worker lease expires or disconnects,
the chunk is resubmitted on a new lease automatically.
5. Lease Renewal
For long-running builds, use RenewalManager to keep CPU leases alive:
use grafos_leasekit::RenewalManager;
let mut renewals = RenewalManager::new();for (i, worker) in workers.iter().enumerate() { renewals.register(i as u64, worker.expiry(), Default::default());}
// In your event loop:let summary = renewals.tick(now);// summary tells you which leases were renewed and which failed.6. Teardown
When done:
- drop CPU leases
- drop the artifact lease (or keep it if you want cache/reuse)
If the coordinator crashes, TTL expiry tears down leases automatically.
Failure Modes
FabricError::Disconnected: worker node unreachable; retry elsewhere.FabricError::LeaseExpired: TTL hit; this is a bug if common; increase TTL or renew earlier.- Output too large: enforce
max_outputand store outputs in block storage.
Observability
Track:
cpu_leases_activetasklets_launched_totaltasklet_duration_mshistogramartifact_bytes_writtenretries_total
The “wow” metric is: lease count returns to zero immediately after job completion.
Variations
- Warm pool: keep a few CPU leases alive for near-zero-latency bursts.
- Cache: store compilation outputs in block storage keyed by hash.
- Heterogeneous workers: pick nodes with AVX512 / big cores for heavy chunks.