Skip to content

Recipe 30: Leasing a Slice of a GPU for a Multi-Kernel Workload

Situation

You need GPU compute for a workload that involves multiple kernels operating on shared data — inference, simulation, an image pipeline. In a conventional setup:

  • The smallest unit you can rent is an entire GPU (20 GB, 40 GB, 80 GB).
  • Your workload needs 7 GB of VRAM. You’re paying for the other 73 GB to sit idle.
  • If you want to share the card across tenants, you’re managing MIG, MPS, or containers yourself.

In the fabricBIOS model, GPU compute is a leased resource. You request exactly the VRAM you need, for exactly as long as you need it. Multiple tenants share the same physical card, each isolated in their own session. When your lease expires, your slice is returned to the pool — no cleanup, no orphaned processes.

This recipe shows the persistent session pattern: lease a VRAM slice, load your data and kernels into it, run as many kernel launches as you need, and give it back. The node handles the CUDA context, driver, and teardown. You just submit work.

What You Build

A multi-kernel pipeline where:

  • You lease a precise VRAM slice — not a whole card.
  • Device memory and loaded modules persist across kernel launches within the lease.
  • Kernel launches are asynchronous — you batch them and sync once.
  • Lease expiry or drop cleans up everything (context, memory, modules) automatically.

Building Blocks

  • grafos_std::gpu::{GpuBuilder, GpuSession, GpuMemHandle, GpuModule}
  • fabricbios-core::gpu_session wire types (ops 0x0601–0x0607)

See:

Design

What You’re Leasing

A GPU lease gives you a slice of a physical GPU’s VRAM. The node that owns the card manages the CUDA driver; you never install CUDA, manage contexts, or worry about other tenants. Your session is an isolated CUDA context on the node — your memory allocations and loaded modules are invisible to other sessions on the same card.

The key idea: GPU capacity is a pool of VRAM slices, not a pool of whole cards. A 48 GB card can serve six 7 GB leases simultaneously, each running independent workloads.

Session Lifecycle

When you acquire a GPU lease, the node creates a persistent CUDA context scoped to your lease. GpuSession is a lightweight client handle — session ID equals lease ID. You interact with it through six operations:

  • mem_alloc / mem_free — allocate and free device memory within your slice
  • mem_write / mem_read — copy data between host and device
  • module_load — JIT-compile and load PTX (stays loaded for the session)
  • launch — launch a kernel (async — returns immediately)
  • sync — wait for all outstanding launches to complete

When the lease expires or is dropped, the node destroys the context — all memory and modules are freed.

When to Use Sessions vs GPU_SUBMIT

GPU_SUBMIT (Recipe 29) is a single-shot dispatch: send a kernel, get output, done. The node creates and destroys a context per call. Good for stateless one-off kernels.

Sessions are for workloads where state accumulates across kernels: model weights that stay resident, buffers that are written by one kernel and read by the next, modules that are loaded once and launched many times. Both coexist on the same lease.

Walkthrough (Implementation Sketch)

1. Two Kernels That Share a Buffer

transform.cu
extern "C" __global__ void scale(float* data, int n, float factor) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] *= factor;
}
extern "C" __global__ void bias(float* data, int n, float b) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] += b;
}

Compile: nvcc --ptx transform.cu -o transform.ptx

2. Acquire a Lease and Open a Session

use grafos_std::gpu::{GpuBuilder, GpuSession};
let lease = GpuBuilder::new()
.min_vram(64 * 1024 * 1024)
.lease_secs(120)
.acquire()?;
let mut session = GpuSession::new(&lease);

3. Upload Data Once

let n: u32 = 1024;
let input: Vec<f32> = (0..n).map(|i| i as f32).collect();
let input_bytes: Vec<u8> = input.iter().flat_map(|f| f.to_ne_bytes()).collect();
let buf = session.mem_alloc(n as u64 * 4)?;
session.mem_write(&buf, 0, &input_bytes)?;

This buffer persists. No re-upload between kernels.

4. Load the Module Once

let ptx = include_bytes!("../vectors/gpu/transform.ptx");
let module = session.module_load(ptx)?;

5. Launch Kernels, Then Sync

let grid = [((n + 255) / 256), 1, 1];
let block = [256, 1, 1];
// Build args: (float* data, int n, float param)
let ptr_bytes = buf.0.to_ne_bytes();
let n_bytes = n.to_ne_bytes();
let arg_sizes = [8u32, 4, 4];
// scale(data, n, 2.0) — async
let mut args = Vec::new();
args.extend_from_slice(&ptr_bytes);
args.extend_from_slice(&n_bytes);
args.extend_from_slice(&2.0f32.to_ne_bytes());
session.launch(&module, "scale", grid, block, &args, &arg_sizes)?;
// bias(data, n, 10.0) — async, runs after scale (same default stream)
args.clear();
args.extend_from_slice(&ptr_bytes);
args.extend_from_slice(&n_bytes);
args.extend_from_slice(&10.0f32.to_ne_bytes());
session.launch(&module, "bias", grid, block, &args, &arg_sizes)?;
// Wait for both to complete
session.sync()?;

6. Read Results

let result_bytes = session.mem_read(&buf, 0, n * 4)?;
let results: Vec<f32> = result_bytes
.chunks_exact(4)
.map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
.collect();
// Each element: (i * 2.0) + 10.0
assert_eq!(results[0], 10.0);
assert_eq!(results[1], 12.0);
assert_eq!(results[100], 210.0);

7. Cleanup

Drop the lease. The node destroys the context — all device memory and modules are freed.

session.mem_free(buf)?; // Optional — lease drop does this anyway.

Failure Modes

  • LeaseExpired: the context and all its resources are gone. Call lease.renew() before expiry for long-running workloads.
  • ALLOC_FAILED: device out of memory. Free unused allocations or request a smaller lease.
  • INVALID_HANDLE: memory handle doesn’t belong to this session. Handles are scoped per-session — you can’t reference another tenant’s memory.
  • LOAD_FAILED: PTX compilation failed. Architecture mismatch (compiled for sm_90, GPU is sm_80). Recompile without -arch and let the driver pick.
  • LAUNCH_FAILED: too many threads, bad kernel name, or argument mismatch.
  • SYNC_FAILED: a prior async launch had an error. Check kernel logic and memory bounds.

Observability

Track:

  • gpu_session_mem_alloc_bytes — device memory per session
  • gpu_session_launch_total — kernel launch count
  • gpu_session_sync_total / gpu_session_sync_errors
  • gpu_lease_acquired / gpu_lease_dropped

Variations

  • Inference server: lease 7 GB for model weights, serve many requests. Each request allocates a small input buffer, runs the kernel chain, reads logits, frees the input buffer. Weights stay resident for the lease duration — clients pay for 7 GB, not 80 GB.
  • Burst compute: lease a GPU slice for 60 seconds, run a batch of simulations, drop the lease. The VRAM returns to the pool immediately for other tenants.
  • Iterative solver: allocate two buffers (current/next), ping-pong between them with a step kernel. Sync every N iterations to check convergence.
  • Multi-module pipeline: load separate PTX modules for each stage (decode, process, encode). All share the same device memory within the lease.
  • Right-sizing: start with a small lease, monitor VRAM usage, adjust min_vram on the next lease. No over-provisioning needed.

Testing

Use a GPU-capable dev cell for the session lifecycle. Wire-format tests remain useful, but cookbook validation should exercise a real leased GPU session:

Terminal window
cargo test -p fabricbios-core -- gpu_session # wire format roundtrips
cargo test -p fabricbiosd --bin fabricbiosd -- gpu_session # dispatch
grafos deploy run --requires gpu --tasklet persistent-gpu-session --json