Skip to content

Recipe 34: Parallel GPU Sessions for Multi-Kernel Burst

Situation

You have a multi-step GPU pipeline — decode, denoise, encode — and 100 inputs to process. Each input needs all three steps on the same GPU, because intermediate buffers live in device memory between kernels.

Your options today:

  • Sequential on one GPU: Process each input serially. Safe, but slow — 100 × pipeline latency.
  • Single-shot dispatch: Stateless GPU_SUBMIT per step. But then intermediate results must be copied back to the host between steps, sent to the GPU again, copied back again. Three round-trips per input instead of one.
  • Session on one GPU: Open a persistent session, run all three kernels per input, read the final result. Correct, but still sequential — one GPU at a time.

What you want is the combination: persistent sessions (multi-kernel state per GPU) with burst parallelism (many GPUs simultaneously). Each GPU gets its own session with loaded modules and allocated buffers. N sessions run in parallel. Wall-clock time: one pipeline latency, not N.

What You Build

A coordinator that:

  • Acquires N GPU leases, one per input (or batch of inputs).
  • Opens a GpuSession on each lease — isolated CUDA context with persistent device memory.
  • Per session: loads a module, allocates buffers, uploads input, runs the kernel chain, reads results.
  • Runs all sessions in parallel.
  • Drops all leases — contexts, buffers, and modules freed automatically.

Building Blocks

  • grafos_std::gpu::{GpuBuilder, GpuLease, GpuSession, GpuMemHandle, GpuModule} — GPU session API — source
  • fabricbios_core::gpu_session wire types (ops 0x0601–0x0607) — source
  • grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore, WorkChunk, ChunkId} — burst orchestration — source
  • grafos_leasekit::{RenewalManager, RenewalPolicy} — lease renewal — source

Related:

Design

Sessions + Burst = Parallel Pipelines

Recipe 30 shows how a session keeps device memory and modules alive across kernel launches — no re-uploading weights or recompiling PTX between steps. Recipe 32 shows how to fan out across many GPUs in parallel — but with stateless single-shot dispatch.

This recipe combines both: each GPU gets a session (state persists across kernels), and many sessions run concurrently (burst parallelism). The execution pattern per session is identical to Recipe 30; the fan-out pattern is identical to Recipe 32. The combination is more than either alone.

Session Lifecycle Per GPU

Lease acquired → Session opened → Module loaded → Buffers allocated
→ Upload input
→ Launch kernel 1 (async)
→ Launch kernel 2 (async — reads kernel 1's output from same buffer)
→ Launch kernel 3 (async)
→ Sync (wait for all three)
→ Read output
→ Drop lease (context, buffers, module freed)

Intermediate buffers (kernel 1 output = kernel 2 input) stay on the device. No host round-trip between kernels. This is the key advantage over chaining GPU_SUBMIT calls.

Contrast with Recipe 32

Recipe 32 uses stateless GPU_SUBMIT — one kernel per GPU, context created and destroyed per call. That works when each evaluation is a single kernel with no shared state.

This recipe uses sessions — the CUDA context and device memory persist across multiple kernel launches within the same session. Use this when your workload is a pipeline of dependent kernels operating on shared device buffers.

Walkthrough (Implementation Sketch)

1. Write the Pipeline Kernels

// pipeline.cu — three-stage video processing pipeline
extern "C" __global__ void decode(unsigned char* raw, float* decoded, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) decoded[i] = (float)raw[i] / 255.0f;
}
extern "C" __global__ void denoise(float* data, float* output, int n, float threshold) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float v = data[i];
output[i] = (v > threshold) ? v : 0.0f;
}
}
extern "C" __global__ void encode(float* input, unsigned char* output, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) output[i] = (unsigned char)(input[i] * 255.0f);
}

Compile once: nvcc --ptx pipeline.cu -o pipeline.ptx

2. Define Work Chunks

use grafos_jobs::{WorkChunk, ChunkId};
#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct VideoSegment {
segment_id: u64,
raw_data: Vec<u8>,
}
impl WorkChunk for VideoSegment {
fn chunk_id(&self) -> ChunkId { ChunkId(self.segment_id) }
fn to_bytes(&self) -> Vec<u8> {
postcard::to_allocvec(self).unwrap()
}
fn from_bytes(bytes: &[u8]) -> grafos_std::Result<Self> {
postcard::from_bytes(bytes).map_err(|_| grafos_std::FabricError::IoError(-200))
}
}

3. Acquire GPU Leases

use grafos_std::gpu::GpuBuilder;
use grafos_leasekit::{RenewalManager, RenewalPolicy};
let segment_count = 100;
let vram_per_segment: u64 = 256 * 1024 * 1024; // 256 MiB for buffers
let lease_ttl: u32 = 120;
let mut leases = Vec::new();
let mut renewal_mgr = RenewalManager::new();
let policy = RenewalPolicy::default();
for _ in 0..segment_count {
match GpuBuilder::new().min_vram(vram_per_segment).lease_secs(lease_ttl).acquire() {
Ok(lease) => {
renewal_mgr.register(
lease.lease_id() as u64,
lease.expires_at_unix_secs(),
policy,
);
leases.push(lease);
}
Err(_) => break,
}
}

4. Run Pipeline Per Session

use grafos_std::gpu::{GpuSession, GpuMemHandle, GpuModule};
fn run_pipeline(
lease: &grafos_std::gpu::GpuLease,
ptx: &[u8],
raw_input: &[u8],
) -> grafos_std::Result<Vec<u8>> {
let n = raw_input.len();
let mut session = GpuSession::new(lease);
// Load module once — persists for the session
let module = session.module_load(ptx)?;
// Allocate three buffers: raw (u8), decoded/denoised (f32), output (u8)
let buf_raw = session.mem_alloc(n as u64)?;
let buf_float_a = session.mem_alloc(n as u64 * 4)?;
let buf_float_b = session.mem_alloc(n as u64 * 4)?;
let buf_output = session.mem_alloc(n as u64)?;
// Upload input
session.mem_write(&buf_raw, 0, raw_input)?;
let grid = [((n as u32 + 255) / 256), 1, 1];
let block = [256u32, 1, 1];
let n_val = n as u32;
// Stage 1: decode (raw u8 → float)
let mut args = Vec::new();
let arg_sizes = [8u32, 8, 4];
args.extend_from_slice(&buf_raw.0.to_ne_bytes());
args.extend_from_slice(&buf_float_a.0.to_ne_bytes());
args.extend_from_slice(&n_val.to_ne_bytes());
session.launch(&module, "decode", grid, block, &args, &arg_sizes)?;
// Stage 2: denoise (float → float, same-stream ordering)
args.clear();
let arg_sizes_2 = [8u32, 8, 4, 4];
args.extend_from_slice(&buf_float_a.0.to_ne_bytes());
args.extend_from_slice(&buf_float_b.0.to_ne_bytes());
args.extend_from_slice(&n_val.to_ne_bytes());
args.extend_from_slice(&0.1f32.to_ne_bytes()); // threshold
session.launch(&module, "denoise", grid, block, &args, &arg_sizes_2)?;
// Stage 3: encode (float → u8)
args.clear();
let arg_sizes_3 = [8u32, 8, 4];
args.extend_from_slice(&buf_float_b.0.to_ne_bytes());
args.extend_from_slice(&buf_output.0.to_ne_bytes());
args.extend_from_slice(&n_val.to_ne_bytes());
session.launch(&module, "encode", grid, block, &args, &arg_sizes_3)?;
// Sync — wait for all three stages
session.sync()?;
// Read final output
let output = session.mem_read(&buf_output, 0, n as u32)?;
// Cleanup (optional — lease drop does this)
session.mem_free(buf_raw)?;
session.mem_free(buf_float_a)?;
session.mem_free(buf_float_b)?;
session.mem_free(buf_output)?;
Ok(output)
}

5. Fan Out with JobCoordinator

use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};
let ptx = include_bytes!("../vectors/gpu/pipeline.ptx");
let mut lease_idx = 0;
let mut output_store = MemoryOutputStore::new();
let mut coord = JobCoordinator::new(RetryPolicy {
max_retries: 3,
initial_backoff_secs: 1,
max_backoff_secs: 16,
});
let result = coord.run(
&segments,
&mut output_store,
|chunk_bytes| {
let segment: VideoSegment = postcard::from_bytes(chunk_bytes)
.map_err(|_| grafos_std::FabricError::IoError(-200))?;
let lease = &leases[lease_idx % leases.len()];
lease_idx += 1;
run_pipeline(lease, ptx, &segment.raw_data)
},
|outputs| {
// Collect all processed segments
let total_bytes: usize = outputs.iter().map(|(_, v)| v.len()).sum();
postcard::to_allocvec(&total_bytes).unwrap()
},
)?;

6. Drop Everything

drop(leases); // All sessions, contexts, buffers, modules freed.
coord.teardown(&mut output_store);

Failure Modes

  • LeaseExpired: Session context destroyed. Transient — retry the whole pipeline for that segment on a new lease. Since each session is independent, other sessions are unaffected.
  • ALLOC_FAILED: Device out of memory within the lease. Reduce buffer sizes or request a larger min_vram.
  • LOAD_FAILED: PTX compilation failed. Architecture mismatch.
  • LAUNCH_FAILED: Too many threads, bad kernel name, or argument mismatch.
  • SYNC_FAILED: An async launch had a runtime error (out-of-bounds access, etc.). The error surfaces at sync time, not launch time.
  • INVALID_HANDLE: Memory handle from a different session or a freed allocation.

Observability

  • gpu_session_launch_total — kernel launches per session (should be 3 per segment)
  • gpu_session_mem_alloc_bytes — device memory per session
  • gpu_session_sync_total / gpu_session_sync_errors — pipeline completions
  • gpu_leases_active — should match segment count, then drop to 0
  • chunks_done / chunks_retry_total — overall job progress
  • Per-segment latency histogram — dominated by the longest kernel in the chain

Variations

  • Inference pipeline: Load model weights once per session (module_load + mem_write), then run forward-pass kernels. Weights stay resident — no re-upload per input.
  • Image processing: Resize → color-correct → sharpen. Three kernels, shared buffers.
  • Scientific simulation: Initialize → iterate → extract. The iteration kernel runs many times on the same buffers; sync only when checking convergence.
  • Mixed pipelines: Some stages on GPU (compute-heavy), some on CPU (I/O-heavy). Read intermediate results from the session between stages.
  • Heterogeneous GPUs: Different segments may land on different GPU architectures. PTX JIT-compilation handles this — the same binary runs on sm_80 and sm_90.

Testing

Terminal window
cargo test -p fabricbios-core -- gpu_session # wire format roundtrips
cargo test -p grafos-jobs -- coordinator # retry and aggregation
grafos deploy run --requires gpu --tasklet parallel-gpu-sessions --json