Recipe 29: Running a CUDA Kernel on a Leased GPU

HISTORICAL — API UPDATED. This recipe was written against the v0 gpu_submit SDK surface (FabricGpu::submit(...).launch() and GpuSubmitBuilder), which was removed from grafos-std in task #57. The current tasklet-side GPU programming model is the v1 session-op path: GpuBuilder::acquire → GpuSession::new(&lease) → mem_alloc → mem_write → module_load → launch → sync → mem_read → (RAII drop). See Recipe 30: Persistent GPU Sessions for the canonical pattern and docs/grafos-std-guide.md GPU Module section for the full API walkthrough.

The v0 QUIC control-plane GPU_SUBMIT wire format (fabricbios-core::gpu_submit::GpuSubmitRequest) that this recipe describes at the daemon/wire level is still supported as a control-plane op for external clients (e.g. fabricbios-quic-interop CLI usage in the “Using the Interop Client Directly” section below still works). Only the tasklet-side SDK wrapper was removed.

The recipe text below is preserved as historical context for how the v0 API looked and how it mapped to the daemon-side CUDA operations. Do not copy-paste the Rust snippets into new code.

Situation

You have a compute workload — signal processing, simulation, inference — and you need a GPU for a few seconds or minutes. You don’t want to rent a whole card, manage CUDA drivers, or worry about cleanup if your process crashes. You want:

A lease on exactly the VRAM you need, not an entire device.
Submit a compiled kernel as bytes — no CUDA toolkit on the client.
Output read back after completion.
Automatic teardown on lease expiry — no orphaned allocations.

What You Build

A program that:

Discovers a GPU node via fabric inventory.
Acquires a GPU lease (VRAM allocation on the remote device).
Submits a compiled PTX kernel with arguments.
Reads back output from device memory.
Drops the lease — VRAM is freed, no cleanup needed.

Building Blocks

grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu, GpuSubmitBuilder}
fabricbios-core GPU_SUBMIT wire format (opcode 0x0600)
fabricbios-platform-linux CUDA Driver API FFI (cuda_ffi module, gpu-cuda feature)
fabricbios-platform-linux NVML FFI (nvml_ffi module) for VRAM discovery

See:

Design

How the Pieces Fit Together

 Client                          Node (fabricbiosd)
 ──────                          ──────────────────
 GpuBuilder::acquire()
   → LEASE_ALLOC ──QUIC──→  GpuLeaseManager::allocate()
                                   cuMemAlloc_v2() → device ptr
   ← lease_id ◄──────────

 gpu.submit("iota", ptx)
   .grid([1,1,1])
   .block([256,1,1])
   .arg(&n_bytes)
   .max_output(1024)
   .launch()
   → GPU_SUBMIT ──QUIC──→  GpuKernelRunner::load(ptx)
                                   cuModuleLoadData()
                                   cuModuleGetFunction("iota")
                                   cuLaunchKernel(...)
                                   cuCtxSynchronize()
                                   cuMemcpyDtoH(output)
   ← output bytes ◄──────

 drop(lease)
   → MEM_FREE ──QUIC──→    GpuLeaseManager::free()
                                   cuMemFree_v2()

Control Plane vs Data Plane

Control plane (QUIC stream, fabricBIOS wire protocol):

LEASE_ALLOC — acquires a GPU resource, allocates VRAM on the device.
GPU_SUBMIT — sends kernel binary + args, receives output bytes.
MEM_FREE — releases the lease and frees device memory.

Data plane (device memory on the GPU node):

cuMemAlloc_v2 allocates a CUdeviceptr (device-side buffer).
Kernel arguments can include the device pointer so kernels write results into leased memory.
After kernel completion, the daemon copies output_size bytes from output_offset in the region back to the client.

There is no host↔device streaming — the kernel binary and output travel over the QUIC control stream. This keeps the protocol simple: one request, one response, synchronous execution.

Kernel Binary Format

CUDA kernels are compiled to PTX (portable) or cubin (architecture-specific). PTX is preferred because the CUDA driver JIT-compiles it for whatever GPU is present:

nvcc --ptx -arch=sm_90 iota.cu -o iota.ptx   # H100
nvcc --ptx iota.cu -o iota.ptx                # any GPU (driver picks arch)

The PTX text is sent as the binary field of GPU_SUBMIT. The daemon calls cuModuleLoadData() which accepts PTX directly.

Walkthrough (Implementation Sketch)

1. Write the CUDA Kernel

// iota.cu — writes threadIdx.x + 1 to each output slot
extern "C" __global__ void iota(float* out, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) out[i] = (float)(i + 1);
}

Compile: nvcc --ptx -arch=sm_90 iota.cu -o iota.ptx

2. Acquire a GPU Lease

use grafos_std::gpu::GpuBuilder;

let lease = GpuBuilder::new()
    .min_vram(1024 * 1024)   // 1 MiB
    .lease_secs(60)
    .acquire()?;

Under the hood this sends LEASE_ALLOC to a GPU node. The node’s GpuLeaseManager calls cuMemAlloc_v2 to reserve a device memory region.

3. Submit the Kernel

let ptx = include_bytes!("../vectors/gpu/iota.ptx");
let n: u32 = 256;

let result = lease.gpu()
    .submit("iota", ptx)
    .grid([1, 1, 1])
    .block([256, 1, 1])
    .arg(&n.to_ne_bytes())       // kernel arg: int n
    .max_output(256 * 4)         // read back 256 floats
    .launch()?;

assert_eq!(result.exit_code, 0);

The daemon receives GPU_SUBMIT, loads PTX via cuModuleLoadData, gets the iota function via cuModuleGetFunction, injects the leased region’s device pointer as the first argument (float* out), launches the kernel, synchronizes, and copies output_size bytes back from device memory.

4. Read the Output

let floats: Vec<f32> = result.output
    .chunks_exact(4)
    .map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
    .collect();

// floats[0] = 1.0, floats[1] = 2.0, ..., floats[255] = 256.0
for (i, &v) in floats.iter().enumerate() {
    assert_eq!(v, (i + 1) as f32);
}

5. Drop the Lease

When lease goes out of scope (or you call lease.free()), MEM_FREE is sent to the node. The node calls cuMemFree_v2 to release the device memory region. No cleanup needed on the client.

Using the Interop Client Directly

You can also test GPU_SUBMIT without writing a grafOS program, using the QUIC interop client:

# Discover GPUs
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
  --addr [::1]:5701 --insecure --op GET_INVENTORY

# Allocate a GPU lease
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
  --addr [::1]:5701 --insecure --op LEASE_ALLOC --cap-request --daemon

# Submit a kernel (PTX loaded from file)
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
  --addr [::1]:5701 --insecure --op GPU_SUBMIT \
  --gpu-binary vectors/gpu/iota.ptx \
  --gpu-kernel iota \
  --gpu-grid 1,1,1 --gpu-block 256,1,1 \
  --gpu-output-size 1024

Failure Modes

STATUS_LOAD_FAILED (3): PTX compilation failed. Usually an architecture mismatch (PTX compiled for sm_90 but GPU is sm_80). Recompile without -arch to let the driver choose.
STATUS_LAUNCH_FAILED (4): Kernel launch or synchronization failed. Common causes: too many threads per block, out-of-bounds memory access, CUDA context error.
STATUS_UNAUTHORIZED (2): No active lease for the GPU resource. Call LEASE_ALLOC first.
STATUS_READ_FAILED (6): cuMemcpyDtoH failed reading output. Check output_offset and output_size are within the allocated region.
LeaseExpired: Lease TTL elapsed before submission completed. Use a longer lease_secs or call lease.renew().
CapacityExceeded: Not enough free VRAM for the requested allocation. Another lease is using the GPU.

Observability

Track with grafos_observe:

gpu_lease_acquired / gpu_lease_dropped — lease lifecycle
gpu_submit_total / gpu_submit_errors — kernel execution counters
gpu_submit_output_bytes — data transferred back from device
gpu_vram_allocated_bytes — current VRAM usage per node

Variations

Vector add: Two input arrays uploaded via args, output read back. Classic GPU hello-world.
Batch submissions: Multiple gpu.submit() calls on the same lease — each reuses the allocated VRAM region.
Large output: Set max_output to match your expected result size. The daemon reads exactly output_size bytes from device memory at output_offset.
AMD ROCm: Same API, different binary format. Compile with hipcc to produce HSACO instead of PTX. Use gpu-submit feature instead of gpu-cuda.
MIG partitioning: On H100/A100 with MIG enabled, NVML reports each MIG instance as a separate device. GpuLeaseManager::Fractional mode maps each MIG instance to one lease slot.
Persistent sessions: For multi-kernel workloads where device memory and loaded modules must persist across launches, see Recipe 30: Persistent GPU Sessions. Sessions keep the CUDA context alive for the lease duration — no re-uploading weights or recompiling PTX between kernels.

Testing

Unit tests (no GPU required):

cargo test -p fabricbios-platform-linux -- gpu
cargo test -p fabricbios-core -- gpu_submit
cargo test -p grafos-std -- gpu

Integration tests (requires NVIDIA GPU + CUDA):

cargo test -p fabricbios-platform-linux --features gpu-cuda -- cuda_vector_add_kernel

Live daemon test (requires GPU node running fabricbiosd):

# Start daemon with CUDA support
cargo run -p fabricbiosd --features gpu-cuda -- control-server --auto-detect --quic-bind '[::1]:5701'

# Run GPU_SUBMIT from another terminal
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
  --addr [::1]:5701 --insecure --op GPU_SUBMIT \
  --gpu-binary vectors/gpu/iota.ptx --gpu-kernel iota \
  --gpu-grid 1,1,1 --gpu-block 256,1,1 --gpu-output-size 1024