Recipe 34: Parallel GPU Sessions for Multi-Kernel Burst
Situation
You have a multi-step GPU pipeline — decode, denoise, encode — and 100 inputs to process. Each input needs all three steps on the same GPU, because intermediate buffers live in device memory between kernels.
Your options today:
- Sequential on one GPU: Process each input serially. Safe, but slow — 100 × pipeline latency.
- Single-shot dispatch: Stateless
GPU_SUBMITper step. But then intermediate results must be copied back to the host between steps, sent to the GPU again, copied back again. Three round-trips per input instead of one. - Session on one GPU: Open a persistent session, run all three kernels per input, read the final result. Correct, but still sequential — one GPU at a time.
What you want is the combination: persistent sessions (multi-kernel state per GPU) with burst parallelism (many GPUs simultaneously). Each GPU gets its own session with loaded modules and allocated buffers. N sessions run in parallel. Wall-clock time: one pipeline latency, not N.
What You Build
A coordinator that:
- Acquires N GPU leases, one per input (or batch of inputs).
- Opens a
GpuSessionon each lease — isolated CUDA context with persistent device memory. - Per session: loads a module, allocates buffers, uploads input, runs the kernel chain, reads results.
- Runs all sessions in parallel.
- Drops all leases — contexts, buffers, and modules freed automatically.
Building Blocks
grafos_std::gpu::{GpuBuilder, GpuLease, GpuSession, GpuMemHandle, GpuModule}— GPU session API — sourcefabricbios_core::gpu_sessionwire types (ops 0x0601–0x0607) — sourcegrafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore, WorkChunk, ChunkId}— burst orchestration — sourcegrafos_leasekit::{RenewalManager, RenewalPolicy}— lease renewal — source
Related:
- Recipe 30: Persistent GPU Session — single session walkthrough
- Recipe 32: 1000 GPUs for One Second — stateless burst
- Recipe 16: Pop-Up Supercomputer — CPU burst pattern
Design
Sessions + Burst = Parallel Pipelines
Recipe 30 shows how a session keeps device memory and modules alive across kernel launches — no re-uploading weights or recompiling PTX between steps. Recipe 32 shows how to fan out across many GPUs in parallel — but with stateless single-shot dispatch.
This recipe combines both: each GPU gets a session (state persists across kernels), and many sessions run concurrently (burst parallelism). The execution pattern per session is identical to Recipe 30; the fan-out pattern is identical to Recipe 32. The combination is more than either alone.
Session Lifecycle Per GPU
Lease acquired → Session opened → Module loaded → Buffers allocated → Upload input → Launch kernel 1 (async) → Launch kernel 2 (async — reads kernel 1's output from same buffer) → Launch kernel 3 (async) → Sync (wait for all three) → Read output → Drop lease (context, buffers, module freed)Intermediate buffers (kernel 1 output = kernel 2 input) stay on the device. No host round-trip
between kernels. This is the key advantage over chaining GPU_SUBMIT calls.
Contrast with Recipe 32
Recipe 32 uses stateless GPU_SUBMIT — one kernel per GPU, context created and destroyed per call.
That works when each evaluation is a single kernel with no shared state.
This recipe uses sessions — the CUDA context and device memory persist across multiple kernel launches within the same session. Use this when your workload is a pipeline of dependent kernels operating on shared device buffers.
Walkthrough (Implementation Sketch)
1. Write the Pipeline Kernels
// pipeline.cu — three-stage video processing pipelineextern "C" __global__ void decode(unsigned char* raw, float* decoded, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) decoded[i] = (float)raw[i] / 255.0f;}
extern "C" __global__ void denoise(float* data, float* output, int n, float threshold) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { float v = data[i]; output[i] = (v > threshold) ? v : 0.0f; }}
extern "C" __global__ void encode(float* input, unsigned char* output, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) output[i] = (unsigned char)(input[i] * 255.0f);}Compile once: nvcc --ptx pipeline.cu -o pipeline.ptx
2. Define Work Chunks
use grafos_jobs::{WorkChunk, ChunkId};
#[derive(Clone, serde::Serialize, serde::Deserialize)]struct VideoSegment { segment_id: u64, raw_data: Vec<u8>,}
impl WorkChunk for VideoSegment { fn chunk_id(&self) -> ChunkId { ChunkId(self.segment_id) } fn to_bytes(&self) -> Vec<u8> { postcard::to_allocvec(self).unwrap() } fn from_bytes(bytes: &[u8]) -> grafos_std::Result<Self> { postcard::from_bytes(bytes).map_err(|_| grafos_std::FabricError::IoError(-200)) }}3. Acquire GPU Leases
use grafos_std::gpu::GpuBuilder;use grafos_leasekit::{RenewalManager, RenewalPolicy};
let segment_count = 100;let vram_per_segment: u64 = 256 * 1024 * 1024; // 256 MiB for bufferslet lease_ttl: u32 = 120;
let mut leases = Vec::new();let mut renewal_mgr = RenewalManager::new();let policy = RenewalPolicy::default();
for _ in 0..segment_count { match GpuBuilder::new().min_vram(vram_per_segment).lease_secs(lease_ttl).acquire() { Ok(lease) => { renewal_mgr.register( lease.lease_id() as u64, lease.expires_at_unix_secs(), policy, ); leases.push(lease); } Err(_) => break, }}4. Run Pipeline Per Session
use grafos_std::gpu::{GpuSession, GpuMemHandle, GpuModule};
fn run_pipeline( lease: &grafos_std::gpu::GpuLease, ptx: &[u8], raw_input: &[u8],) -> grafos_std::Result<Vec<u8>> { let n = raw_input.len(); let mut session = GpuSession::new(lease);
// Load module once — persists for the session let module = session.module_load(ptx)?;
// Allocate three buffers: raw (u8), decoded/denoised (f32), output (u8) let buf_raw = session.mem_alloc(n as u64)?; let buf_float_a = session.mem_alloc(n as u64 * 4)?; let buf_float_b = session.mem_alloc(n as u64 * 4)?; let buf_output = session.mem_alloc(n as u64)?;
// Upload input session.mem_write(&buf_raw, 0, raw_input)?;
let grid = [((n as u32 + 255) / 256), 1, 1]; let block = [256u32, 1, 1]; let n_val = n as u32;
// Stage 1: decode (raw u8 → float) let mut args = Vec::new(); let arg_sizes = [8u32, 8, 4]; args.extend_from_slice(&buf_raw.0.to_ne_bytes()); args.extend_from_slice(&buf_float_a.0.to_ne_bytes()); args.extend_from_slice(&n_val.to_ne_bytes()); session.launch(&module, "decode", grid, block, &args, &arg_sizes)?;
// Stage 2: denoise (float → float, same-stream ordering) args.clear(); let arg_sizes_2 = [8u32, 8, 4, 4]; args.extend_from_slice(&buf_float_a.0.to_ne_bytes()); args.extend_from_slice(&buf_float_b.0.to_ne_bytes()); args.extend_from_slice(&n_val.to_ne_bytes()); args.extend_from_slice(&0.1f32.to_ne_bytes()); // threshold session.launch(&module, "denoise", grid, block, &args, &arg_sizes_2)?;
// Stage 3: encode (float → u8) args.clear(); let arg_sizes_3 = [8u32, 8, 4]; args.extend_from_slice(&buf_float_b.0.to_ne_bytes()); args.extend_from_slice(&buf_output.0.to_ne_bytes()); args.extend_from_slice(&n_val.to_ne_bytes()); session.launch(&module, "encode", grid, block, &args, &arg_sizes_3)?;
// Sync — wait for all three stages session.sync()?;
// Read final output let output = session.mem_read(&buf_output, 0, n as u32)?;
// Cleanup (optional — lease drop does this) session.mem_free(buf_raw)?; session.mem_free(buf_float_a)?; session.mem_free(buf_float_b)?; session.mem_free(buf_output)?;
Ok(output)}5. Fan Out with JobCoordinator
use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};
let ptx = include_bytes!("../vectors/gpu/pipeline.ptx");let mut lease_idx = 0;
let mut output_store = MemoryOutputStore::new();let mut coord = JobCoordinator::new(RetryPolicy { max_retries: 3, initial_backoff_secs: 1, max_backoff_secs: 16,});
let result = coord.run( &segments, &mut output_store, |chunk_bytes| { let segment: VideoSegment = postcard::from_bytes(chunk_bytes) .map_err(|_| grafos_std::FabricError::IoError(-200))?;
let lease = &leases[lease_idx % leases.len()]; lease_idx += 1;
run_pipeline(lease, ptx, &segment.raw_data) }, |outputs| { // Collect all processed segments let total_bytes: usize = outputs.iter().map(|(_, v)| v.len()).sum(); postcard::to_allocvec(&total_bytes).unwrap() },)?;6. Drop Everything
drop(leases); // All sessions, contexts, buffers, modules freed.coord.teardown(&mut output_store);Failure Modes
LeaseExpired: Session context destroyed. Transient — retry the whole pipeline for that segment on a new lease. Since each session is independent, other sessions are unaffected.ALLOC_FAILED: Device out of memory within the lease. Reduce buffer sizes or request a largermin_vram.LOAD_FAILED: PTX compilation failed. Architecture mismatch.LAUNCH_FAILED: Too many threads, bad kernel name, or argument mismatch.SYNC_FAILED: An async launch had a runtime error (out-of-bounds access, etc.). The error surfaces at sync time, not launch time.INVALID_HANDLE: Memory handle from a different session or a freed allocation.
Observability
gpu_session_launch_total— kernel launches per session (should be 3 per segment)gpu_session_mem_alloc_bytes— device memory per sessiongpu_session_sync_total/gpu_session_sync_errors— pipeline completionsgpu_leases_active— should match segment count, then drop to 0chunks_done/chunks_retry_total— overall job progress- Per-segment latency histogram — dominated by the longest kernel in the chain
Variations
- Inference pipeline: Load model weights once per session (
module_load+mem_write), then run forward-pass kernels. Weights stay resident — no re-upload per input. - Image processing: Resize → color-correct → sharpen. Three kernels, shared buffers.
- Scientific simulation: Initialize → iterate → extract. The iteration kernel runs many times on the same buffers; sync only when checking convergence.
- Mixed pipelines: Some stages on GPU (compute-heavy), some on CPU (I/O-heavy). Read intermediate results from the session between stages.
- Heterogeneous GPUs: Different segments may land on different GPU architectures. PTX JIT-compilation handles this — the same binary runs on sm_80 and sm_90.
Testing
cargo test -p fabricbios-core -- gpu_session # wire format roundtripscargo test -p grafos-jobs -- coordinator # retry and aggregationgrafos deploy run --requires gpu --tasklet parallel-gpu-sessions --json