Recipe 34: Parallel GPU Sessions for Multi-Kernel Burst

Situation

You have a multi-step GPU pipeline — decode, denoise, encode — and 100 inputs to process. Each input needs all three steps on the same GPU, because intermediate buffers live in device memory between kernels.

Your options today:

Sequential on one GPU: Process each input serially. Safe, but slow — 100 × pipeline latency.
Single-shot dispatch: Stateless GPU_SUBMIT per step. But then intermediate results must be copied back to the host between steps, sent to the GPU again, copied back again. Three round-trips per input instead of one.
Session on one GPU: Open a persistent session, run all three kernels per input, read the final result. Correct, but still sequential — one GPU at a time.

What you want is the combination: persistent sessions (multi-kernel state per GPU) with burst parallelism (many GPUs simultaneously). Each GPU gets its own session with loaded modules and allocated buffers. N sessions run in parallel. Wall-clock time: one pipeline latency, not N.

What You Build

A coordinator that:

Acquires N GPU leases, one per input (or batch of inputs).
Opens a GpuSession on each lease — isolated CUDA context with persistent device memory.
Per session: loads a module, allocates buffers, uploads input, runs the kernel chain, reads results.
Runs all sessions in parallel.
Drops all leases — contexts, buffers, and modules freed automatically.

Building Blocks

grafos_std::gpu::{GpuBuilder, GpuLease, GpuSession, GpuMemHandle, GpuModule} — GPU session API — source
fabricbios_core::gpu_session wire types (ops 0x0601–0x0607) — source
grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore, WorkChunk, ChunkId} — burst orchestration — source
grafos_leasekit::{RenewalManager, RenewalPolicy} — lease renewal — source

Recipe 30: Persistent GPU Session — single session walkthrough
Recipe 32: 1000 GPUs for One Second — stateless burst
Recipe 16: Pop-Up Supercomputer — CPU burst pattern

Design

Sessions + Burst = Parallel Pipelines

Recipe 30 shows how a session keeps device memory and modules alive across kernel launches — no re-uploading weights or recompiling PTX between steps. Recipe 32 shows how to fan out across many GPUs in parallel — but with stateless single-shot dispatch.

This recipe combines both: each GPU gets a session (state persists across kernels), and many sessions run concurrently (burst parallelism). The execution pattern per session is identical to Recipe 30; the fan-out pattern is identical to Recipe 32. The combination is more than either alone.

Session Lifecycle Per GPU

Lease acquired → Session opened → Module loaded → Buffers allocated
    → Upload input
    → Launch kernel 1 (async)
    → Launch kernel 2 (async — reads kernel 1's output from same buffer)
    → Launch kernel 3 (async)
    → Sync (wait for all three)
    → Read output
    → Drop lease (context, buffers, module freed)

Intermediate buffers (kernel 1 output = kernel 2 input) stay on the device. No host round-trip between kernels. This is the key advantage over chaining GPU_SUBMIT calls.

Contrast with Recipe 32

Recipe 32 uses stateless GPU_SUBMIT — one kernel per GPU, context created and destroyed per call. That works when each evaluation is a single kernel with no shared state.

This recipe uses sessions — the CUDA context and device memory persist across multiple kernel launches within the same session. Use this when your workload is a pipeline of dependent kernels operating on shared device buffers.

Walkthrough (Implementation Sketch)

1. Write the Pipeline Kernels

// pipeline.cu — three-stage video processing pipeline
extern "C" __global__ void decode(unsigned char* raw, float* decoded, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) decoded[i] = (float)raw[i] / 255.0f;
}

extern "C" __global__ void denoise(float* data, float* output, int n, float threshold) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float v = data[i];
        output[i] = (v > threshold) ? v : 0.0f;
    }
}

extern "C" __global__ void encode(float* input, unsigned char* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) output[i] = (unsigned char)(input[i] * 255.0f);
}

Compile once: nvcc --ptx pipeline.cu -o pipeline.ptx

2. Define Work Chunks

use grafos_jobs::{WorkChunk, ChunkId};

#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct VideoSegment {
    segment_id: u64,
    raw_data: Vec<u8>,
}

impl WorkChunk for VideoSegment {
    fn chunk_id(&self) -> ChunkId { ChunkId(self.segment_id) }
    fn to_bytes(&self) -> Vec<u8> {
        postcard::to_allocvec(self).unwrap()
    }
    fn from_bytes(bytes: &[u8]) -> grafos_std::Result<Self> {
        postcard::from_bytes(bytes).map_err(|_| grafos_std::FabricError::IoError(-200))
    }
}

3. Acquire GPU Leases

use grafos_std::gpu::GpuBuilder;
use grafos_leasekit::{RenewalManager, RenewalPolicy};

let segment_count = 100;
let vram_per_segment: u64 = 256 * 1024 * 1024; // 256 MiB for buffers
let lease_ttl: u32 = 120;

let mut leases = Vec::new();
let mut renewal_mgr = RenewalManager::new();
let policy = RenewalPolicy::default();

for _ in 0..segment_count {
    match GpuBuilder::new().min_vram(vram_per_segment).lease_secs(lease_ttl).acquire() {
        Ok(lease) => {
            renewal_mgr.register(
                lease.lease_id() as u64,
                lease.expires_at_unix_secs(),
                policy,
            );
            leases.push(lease);
        }
        Err(_) => break,
    }
}

4. Run Pipeline Per Session

use grafos_std::gpu::{GpuSession, GpuMemHandle, GpuModule};

fn run_pipeline(
    lease: &grafos_std::gpu::GpuLease,
    ptx: &[u8],
    raw_input: &[u8],
) -> grafos_std::Result<Vec<u8>> {
    let n = raw_input.len();
    let mut session = GpuSession::new(lease);

    // Load module once — persists for the session
    let module = session.module_load(ptx)?;

    // Allocate three buffers: raw (u8), decoded/denoised (f32), output (u8)
    let buf_raw = session.mem_alloc(n as u64)?;
    let buf_float_a = session.mem_alloc(n as u64 * 4)?;
    let buf_float_b = session.mem_alloc(n as u64 * 4)?;
    let buf_output = session.mem_alloc(n as u64)?;

    // Upload input
    session.mem_write(&buf_raw, 0, raw_input)?;

    let grid = [((n as u32 + 255) / 256), 1, 1];
    let block = [256u32, 1, 1];
    let n_val = n as u32;

    // Stage 1: decode (raw u8 → float)
    let mut args = Vec::new();
    let arg_sizes = [8u32, 8, 4];
    args.extend_from_slice(&buf_raw.0.to_ne_bytes());
    args.extend_from_slice(&buf_float_a.0.to_ne_bytes());
    args.extend_from_slice(&n_val.to_ne_bytes());
    session.launch(&module, "decode", grid, block, &args, &arg_sizes)?;

    // Stage 2: denoise (float → float, same-stream ordering)
    args.clear();
    let arg_sizes_2 = [8u32, 8, 4, 4];
    args.extend_from_slice(&buf_float_a.0.to_ne_bytes());
    args.extend_from_slice(&buf_float_b.0.to_ne_bytes());
    args.extend_from_slice(&n_val.to_ne_bytes());
    args.extend_from_slice(&0.1f32.to_ne_bytes()); // threshold
    session.launch(&module, "denoise", grid, block, &args, &arg_sizes_2)?;

    // Stage 3: encode (float → u8)
    args.clear();
    let arg_sizes_3 = [8u32, 8, 4];
    args.extend_from_slice(&buf_float_b.0.to_ne_bytes());
    args.extend_from_slice(&buf_output.0.to_ne_bytes());
    args.extend_from_slice(&n_val.to_ne_bytes());
    session.launch(&module, "encode", grid, block, &args, &arg_sizes_3)?;

    // Sync — wait for all three stages
    session.sync()?;

    // Read final output
    let output = session.mem_read(&buf_output, 0, n as u32)?;

    // Cleanup (optional — lease drop does this)
    session.mem_free(buf_raw)?;
    session.mem_free(buf_float_a)?;
    session.mem_free(buf_float_b)?;
    session.mem_free(buf_output)?;

    Ok(output)
}

5. Fan Out with JobCoordinator

use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};

let ptx = include_bytes!("../vectors/gpu/pipeline.ptx");
let mut lease_idx = 0;

let mut output_store = MemoryOutputStore::new();
let mut coord = JobCoordinator::new(RetryPolicy {
    max_retries: 3,
    initial_backoff_secs: 1,
    max_backoff_secs: 16,
});

let result = coord.run(
    &segments,
    &mut output_store,
    |chunk_bytes| {
        let segment: VideoSegment = postcard::from_bytes(chunk_bytes)
            .map_err(|_| grafos_std::FabricError::IoError(-200))?;

        let lease = &leases[lease_idx % leases.len()];
        lease_idx += 1;

        run_pipeline(lease, ptx, &segment.raw_data)
    },
    |outputs| {
        // Collect all processed segments
        let total_bytes: usize = outputs.iter().map(|(_, v)| v.len()).sum();
        postcard::to_allocvec(&total_bytes).unwrap()
    },
)?;

6. Drop Everything

drop(leases);  // All sessions, contexts, buffers, modules freed.
coord.teardown(&mut output_store);

Failure Modes

LeaseExpired: Session context destroyed. Transient — retry the whole pipeline for that segment on a new lease. Since each session is independent, other sessions are unaffected.
ALLOC_FAILED: Device out of memory within the lease. Reduce buffer sizes or request a larger min_vram.
LOAD_FAILED: PTX compilation failed. Architecture mismatch.
LAUNCH_FAILED: Too many threads, bad kernel name, or argument mismatch.
SYNC_FAILED: An async launch had a runtime error (out-of-bounds access, etc.). The error surfaces at sync time, not launch time.
INVALID_HANDLE: Memory handle from a different session or a freed allocation.

Observability

gpu_session_launch_total — kernel launches per session (should be 3 per segment)
gpu_session_mem_alloc_bytes — device memory per session
gpu_session_sync_total / gpu_session_sync_errors — pipeline completions
gpu_leases_active — should match segment count, then drop to 0
chunks_done / chunks_retry_total — overall job progress
Per-segment latency histogram — dominated by the longest kernel in the chain

Variations

Inference pipeline: Load model weights once per session (module_load + mem_write), then run forward-pass kernels. Weights stay resident — no re-upload per input.
Image processing: Resize → color-correct → sharpen. Three kernels, shared buffers.
Scientific simulation: Initialize → iterate → extract. The iteration kernel runs many times on the same buffers; sync only when checking convergence.
Mixed pipelines: Some stages on GPU (compute-heavy), some on CPU (I/O-heavy). Read intermediate results from the session between stages.
Heterogeneous GPUs: Different segments may land on different GPU architectures. PTX JIT-compilation handles this — the same binary runs on sm_80 and sm_90.

Testing

cargo test -p fabricbios-core -- gpu_session  # wire format roundtrips
cargo test -p grafos-jobs -- coordinator     # retry and aggregation
grafos deploy run --requires gpu --tasklet parallel-gpu-sessions --json