Recipe 32: 1000 GPUs for One Second

Situation

You’re a researcher evaluating 1000 hyperparameter configurations for a model. Today’s options:

SLURM queue: Wait days for 8 GPUs, then run sequentially for hours. Wall-clock: days.
Rent a cluster: Provision 1000 GPUs. Provisioning takes longer than compute. You pay for the provisioning time. Teardown is another project.
Spot instances: Chase preemptions across regions. Maybe get 200 GPUs. Write retry logic. Still hours.

The compute itself is embarrassingly parallel. Each configuration is independent — run a kernel, read a number, move on. If you could get 1000 GPUs simultaneously, the total wall-clock time would be seconds.

In the fabricBIOS model, GPU compute is a pool of VRAM slices, not a pool of whole machines. You lease 1000 slices, submit one kernel to each, collect results, and drop everything. The VRAM returns to the pool immediately. No provisioning, no teardown, no orphaned resources.

What You Build

A coordinator that:

Defines 1000 work chunks (each = one hyperparameter configuration).
Acquires GPU leases in a loop — as many as the fabric provides.
Fans out: each chunk → GPU_SUBMIT to a leased VRAM slice.
Collects results via JobCoordinator with automatic retry on transient failures.
Finds the best configuration.
Drops all leases — VRAM returns to the pool.

Building Blocks

grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu} — GPU leasing and submission — source
grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore, WorkChunk, ChunkId} — idempotent burst compute — source
grafos_leasekit::{RenewalManager, RenewalPolicy} — lease renewal during execution — source

GPU_SUBMIT wire format (source)
Recipe 16: Pop-Up Supercomputer — same burst pattern with CPU tasklets
Recipe 29: CUDA Kernel on Leased GPU — single GPU_SUBMIT walkthrough

Design

GPU Capacity as a Pool

A 48 GB GPU can serve six 8 GB leases or forty-eight 1 GB leases simultaneously. A fabric with 100 nodes, each with 4 GPUs, exposes up to 19,200 one-GB slices. You don’t rent machines — you lease VRAM.

Your coordinator doesn’t know or care which physical cards it gets. It asks for VRAM, gets lease handles, submits kernels, reads results.

Elastic Acquisition

Acquire in a loop until you hit the target or the fabric is exhausted:

let mut leases = Vec::new();
for _ in 0..1000 {
    match GpuBuilder::new().min_vram(vram_per_config).lease_secs(60).acquire() {
        Ok(lease) => leases.push(lease),
        Err(_) => break,  // fabric exhausted — run with what we have
    }
}

If you get 800 instead of 1000, the job still runs — just 200 chunks wait for a free slot. If you get more later (leases expire on other tenants), you can add leases mid-run.

Stateless Kernels

Each kernel invocation is stateless: input bytes in, output bytes out. No shared device memory between invocations. This is the GPU_SUBMIT pattern (Recipe 29), not the session pattern (Recipe 30). The node creates a CUDA context, loads PTX, launches the kernel, reads output, destroys the context — one call.

This matters because stateless dispatch is idempotent. If a lease expires mid-execution, JobCoordinator retries the chunk on a different lease.

Walkthrough (Implementation Sketch)

1. Define Work Chunks

use grafos_jobs::{WorkChunk, ChunkId};

#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct HyperparamConfig {
    config_id: u64,
    learning_rate: f32,
    batch_size: u32,
    dropout: f32,
}

impl WorkChunk for HyperparamConfig {
    fn chunk_id(&self) -> ChunkId { ChunkId(self.config_id) }
    fn to_bytes(&self) -> Vec<u8> {
        postcard::to_allocvec(self).unwrap()
    }
    fn from_bytes(bytes: &[u8]) -> grafos_std::Result<Self> {
        postcard::from_bytes(bytes).map_err(|_| grafos_std::FabricError::IoError(-200))
    }
}

2. Generate the Search Space

let configs: Vec<Box<dyn WorkChunk>> = (0..1000).map(|i| {
    Box::new(HyperparamConfig {
        config_id: i,
        learning_rate: 0.0001 * (1.0 + (i % 100) as f32 * 0.01),
        batch_size: 32 * (1 + (i / 100) as u32),
        dropout: 0.1 + (i % 10) as f32 * 0.05,
    }) as Box<dyn WorkChunk>
}).collect();

3. Acquire GPU Leases

use grafos_std::gpu::GpuBuilder;
use grafos_leasekit::{RenewalManager, RenewalPolicy};

let vram_per_config: u64 = 512 * 1024 * 1024; // 512 MiB per eval
let lease_ttl: u32 = 120;

let mut leases: Vec<_> = Vec::new();
let mut renewal_mgr = RenewalManager::new();
let policy = RenewalPolicy::default();

for _ in 0..1000 {
    match GpuBuilder::new().min_vram(vram_per_config).lease_secs(lease_ttl).acquire() {
        Ok(lease) => {
            renewal_mgr.register(
                lease.lease_id() as u64,
                lease.expires_at_unix_secs(),
                policy,
            );
            leases.push(lease);
        }
        Err(_) => break,
    }
}
// Got leases.len() GPU slices — might be 1000, might be fewer.

4. Compile the Evaluation Kernel

// eval_config.cu — evaluate one hyperparameter configuration
extern "C" __global__ void eval_config(
    float* output,          // single float: validation loss
    float learning_rate,
    int batch_size,
    float dropout
) {
    // ... mini training loop on embedded dataset ...
    // Write final validation loss to output[0]
    if (threadIdx.x == 0) {
        output[0] = validation_loss;
    }
}

Compile once: nvcc --ptx eval_config.cu -o eval_config.ptx

5. Fan Out with JobCoordinator

use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};

let ptx = include_bytes!("../vectors/gpu/eval_config.ptx");
let mut lease_idx = 0;

let mut output_store = MemoryOutputStore::new();
let mut coord = JobCoordinator::new(RetryPolicy {
    max_retries: 3,
    initial_backoff_secs: 1,
    max_backoff_secs: 16,
});

let result = coord.run(
    &configs,
    &mut output_store,
    |chunk_bytes| {
        let config: HyperparamConfig = postcard::from_bytes(chunk_bytes)
            .map_err(|_| grafos_std::FabricError::IoError(-200))?;

        // Round-robin across available leases
        let lease = &leases[lease_idx % leases.len()];
        lease_idx += 1;

        // Build kernel args
        let lr_bytes = config.learning_rate.to_ne_bytes();
        let bs_bytes = config.batch_size.to_ne_bytes();
        let do_bytes = config.dropout.to_ne_bytes();

        let result = lease.gpu()
            .submit("eval_config", ptx)
            .grid([1, 1, 1])
            .block([256, 1, 1])
            .arg(&lr_bytes)
            .arg(&bs_bytes)
            .arg(&do_bytes)
            .max_output(4)        // one f32
            .launch()?;

        Ok(result.output)
    },
    |outputs| {
        // Find the config with the lowest loss
        let mut best_id: u64 = 0;
        let mut best_loss: f32 = f32::MAX;
        for (chunk_id, output) in outputs {
            if output.len() >= 4 {
                let loss = f32::from_ne_bytes(output[..4].try_into().unwrap());
                if loss < best_loss {
                    best_loss = loss;
                    best_id = chunk_id.0;
                }
            }
        }
        postcard::to_allocvec(&(best_id, best_loss)).unwrap()
    },
)?;

let (best_config_id, best_loss): (u64, f32) =
    postcard::from_bytes(&result.aggregate).unwrap();

6. Drop Everything

drop(leases);  // All VRAM returns to pool immediately.
coord.teardown(&mut output_store);

Your graph shows: rapid ramp to ~1000 GPU leases, flat during compute, immediate drop to zero.

Failure Modes

CapacityExceeded: Fabric doesn’t have 1000 free slices. Job runs with fewer — just slower.
LeaseExpired: Classified as transient by RetryPolicy. Chunk is retried on another lease.
STATUS_LOAD_FAILED: PTX compilation failed. Architecture mismatch — recompile without -arch.
STATUS_LAUNCH_FAILED: Too many threads or bad kernel args. Permanent error — chunk fails.
Disconnected: Node went away. Transient — retry on a different node’s lease.
Coordinator crash: All leases expire on their own. VRAM returns. Rerun the job.

Observability

gpu_leases_active — should spike to ~1000, then drop to 0
gpu_submit_total / gpu_submit_errors — kernel execution rate
chunks_done / chunks_retry_total — job progress
gpu_vram_allocated_bytes — total fabric VRAM in use
Wall-clock time: start to finish should be seconds, not hours

Variations

Monte Carlo simulation: Each GPU runs a simulation with different random seeds. Aggregate by averaging or computing confidence intervals.
Batch inference: Each GPU runs inference on a different input batch. Aggregate results into a single output set.
Rendering: Each GPU renders a different frame or tile. Collect frames into a video.
Genetic algorithm: Each GPU evaluates fitness for a different individual. Aggregate selects the fittest for the next generation.
Right-sizing VRAM: If your kernel only needs 256 MiB, lease 256 MiB — not 8 GB. Smaller slices mean more concurrent leases from the same hardware.

Testing

Run scheduler and job tests locally, then validate the GPU burst on a GPU-capable cell:

cargo test -p grafos-jobs -- coordinator     # retry and aggregation
cargo test -p fabricbios-core -- gpu_submit  # wire format roundtrips
grafos deploy run --requires gpu --replicas 1000 --tasklet gpu-burst --json