Skip to content

Recipe 29: Running a CUDA Kernel on a Leased GPU

HISTORICAL — API UPDATED. This recipe was written against the v0 gpu_submit SDK surface (FabricGpu::submit(...).launch() and GpuSubmitBuilder), which was removed from grafos-std in task #57. The current tasklet-side GPU programming model is the v1 session-op path: GpuBuilder::acquireGpuSession::new(&lease)mem_allocmem_writemodule_loadlaunchsyncmem_read → (RAII drop). See Recipe 30: Persistent GPU Sessions for the canonical pattern and docs/grafos-std-guide.md GPU Module section for the full API walkthrough.

The v0 QUIC control-plane GPU_SUBMIT wire format (fabricbios-core::gpu_submit::GpuSubmitRequest) that this recipe describes at the daemon/wire level is still supported as a control-plane op for external clients (e.g. fabricbios-quic-interop CLI usage in the “Using the Interop Client Directly” section below still works). Only the tasklet-side SDK wrapper was removed.

The recipe text below is preserved as historical context for how the v0 API looked and how it mapped to the daemon-side CUDA operations. Do not copy-paste the Rust snippets into new code.

Situation

You have a compute workload — signal processing, simulation, inference — and you need a GPU for a few seconds or minutes. You don’t want to rent a whole card, manage CUDA drivers, or worry about cleanup if your process crashes. You want:

  • A lease on exactly the VRAM you need, not an entire device.
  • Submit a compiled kernel as bytes — no CUDA toolkit on the client.
  • Output read back after completion.
  • Automatic teardown on lease expiry — no orphaned allocations.

What You Build

A program that:

  1. Discovers a GPU node via fabric inventory.
  2. Acquires a GPU lease (VRAM allocation on the remote device).
  3. Submits a compiled PTX kernel with arguments.
  4. Reads back output from device memory.
  5. Drops the lease — VRAM is freed, no cleanup needed.

Building Blocks

  • grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu, GpuSubmitBuilder}
  • fabricbios-core GPU_SUBMIT wire format (opcode 0x0600)
  • fabricbios-platform-linux CUDA Driver API FFI (cuda_ffi module, gpu-cuda feature)
  • fabricbios-platform-linux NVML FFI (nvml_ffi module) for VRAM discovery

See:

Design

How the Pieces Fit Together

Client Node (fabricbiosd)
────── ──────────────────
GpuBuilder::acquire()
→ LEASE_ALLOC ──QUIC──→ GpuLeaseManager::allocate()
cuMemAlloc_v2() → device ptr
← lease_id ◄──────────
gpu.submit("iota", ptx)
.grid([1,1,1])
.block([256,1,1])
.arg(&n_bytes)
.max_output(1024)
.launch()
→ GPU_SUBMIT ──QUIC──→ GpuKernelRunner::load(ptx)
cuModuleLoadData()
cuModuleGetFunction("iota")
cuLaunchKernel(...)
cuCtxSynchronize()
cuMemcpyDtoH(output)
← output bytes ◄──────
drop(lease)
→ MEM_FREE ──QUIC──→ GpuLeaseManager::free()
cuMemFree_v2()

Control Plane vs Data Plane

Control plane (QUIC stream, fabricBIOS wire protocol):

  • LEASE_ALLOC — acquires a GPU resource, allocates VRAM on the device.
  • GPU_SUBMIT — sends kernel binary + args, receives output bytes.
  • MEM_FREE — releases the lease and frees device memory.

Data plane (device memory on the GPU node):

  • cuMemAlloc_v2 allocates a CUdeviceptr (device-side buffer).
  • Kernel arguments can include the device pointer so kernels write results into leased memory.
  • After kernel completion, the daemon copies output_size bytes from output_offset in the region back to the client.

There is no host↔device streaming — the kernel binary and output travel over the QUIC control stream. This keeps the protocol simple: one request, one response, synchronous execution.

Kernel Binary Format

CUDA kernels are compiled to PTX (portable) or cubin (architecture-specific). PTX is preferred because the CUDA driver JIT-compiles it for whatever GPU is present:

Terminal window
nvcc --ptx -arch=sm_90 iota.cu -o iota.ptx # H100
nvcc --ptx iota.cu -o iota.ptx # any GPU (driver picks arch)

The PTX text is sent as the binary field of GPU_SUBMIT. The daemon calls cuModuleLoadData() which accepts PTX directly.

Walkthrough (Implementation Sketch)

1. Write the CUDA Kernel

// iota.cu — writes threadIdx.x + 1 to each output slot
extern "C" __global__ void iota(float* out, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) out[i] = (float)(i + 1);
}

Compile: nvcc --ptx -arch=sm_90 iota.cu -o iota.ptx

2. Acquire a GPU Lease

use grafos_std::gpu::GpuBuilder;
let lease = GpuBuilder::new()
.min_vram(1024 * 1024) // 1 MiB
.lease_secs(60)
.acquire()?;

Under the hood this sends LEASE_ALLOC to a GPU node. The node’s GpuLeaseManager calls cuMemAlloc_v2 to reserve a device memory region.

3. Submit the Kernel

let ptx = include_bytes!("../vectors/gpu/iota.ptx");
let n: u32 = 256;
let result = lease.gpu()
.submit("iota", ptx)
.grid([1, 1, 1])
.block([256, 1, 1])
.arg(&n.to_ne_bytes()) // kernel arg: int n
.max_output(256 * 4) // read back 256 floats
.launch()?;
assert_eq!(result.exit_code, 0);

The daemon receives GPU_SUBMIT, loads PTX via cuModuleLoadData, gets the iota function via cuModuleGetFunction, injects the leased region’s device pointer as the first argument (float* out), launches the kernel, synchronizes, and copies output_size bytes back from device memory.

4. Read the Output

let floats: Vec<f32> = result.output
.chunks_exact(4)
.map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
.collect();
// floats[0] = 1.0, floats[1] = 2.0, ..., floats[255] = 256.0
for (i, &v) in floats.iter().enumerate() {
assert_eq!(v, (i + 1) as f32);
}

5. Drop the Lease

When lease goes out of scope (or you call lease.free()), MEM_FREE is sent to the node. The node calls cuMemFree_v2 to release the device memory region. No cleanup needed on the client.

Using the Interop Client Directly

You can also test GPU_SUBMIT without writing a grafOS program, using the QUIC interop client:

Terminal window
# Discover GPUs
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
--addr [::1]:5701 --insecure --op GET_INVENTORY
# Allocate a GPU lease
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
--addr [::1]:5701 --insecure --op LEASE_ALLOC --cap-request --daemon
# Submit a kernel (PTX loaded from file)
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
--addr [::1]:5701 --insecure --op GPU_SUBMIT \
--gpu-binary vectors/gpu/iota.ptx \
--gpu-kernel iota \
--gpu-grid 1,1,1 --gpu-block 256,1,1 \
--gpu-output-size 1024

Failure Modes

  • STATUS_LOAD_FAILED (3): PTX compilation failed. Usually an architecture mismatch (PTX compiled for sm_90 but GPU is sm_80). Recompile without -arch to let the driver choose.
  • STATUS_LAUNCH_FAILED (4): Kernel launch or synchronization failed. Common causes: too many threads per block, out-of-bounds memory access, CUDA context error.
  • STATUS_UNAUTHORIZED (2): No active lease for the GPU resource. Call LEASE_ALLOC first.
  • STATUS_READ_FAILED (6): cuMemcpyDtoH failed reading output. Check output_offset and output_size are within the allocated region.
  • LeaseExpired: Lease TTL elapsed before submission completed. Use a longer lease_secs or call lease.renew().
  • CapacityExceeded: Not enough free VRAM for the requested allocation. Another lease is using the GPU.

Observability

Track with grafos_observe:

  • gpu_lease_acquired / gpu_lease_dropped — lease lifecycle
  • gpu_submit_total / gpu_submit_errors — kernel execution counters
  • gpu_submit_output_bytes — data transferred back from device
  • gpu_vram_allocated_bytes — current VRAM usage per node

Variations

  • Vector add: Two input arrays uploaded via args, output read back. Classic GPU hello-world.
  • Batch submissions: Multiple gpu.submit() calls on the same lease — each reuses the allocated VRAM region.
  • Large output: Set max_output to match your expected result size. The daemon reads exactly output_size bytes from device memory at output_offset.
  • AMD ROCm: Same API, different binary format. Compile with hipcc to produce HSACO instead of PTX. Use gpu-submit feature instead of gpu-cuda.
  • MIG partitioning: On H100/A100 with MIG enabled, NVML reports each MIG instance as a separate device. GpuLeaseManager::Fractional mode maps each MIG instance to one lease slot.
  • Persistent sessions: For multi-kernel workloads where device memory and loaded modules must persist across launches, see Recipe 30: Persistent GPU Sessions. Sessions keep the CUDA context alive for the lease duration — no re-uploading weights or recompiling PTX between kernels.

Testing

Unit tests (no GPU required):

Terminal window
cargo test -p fabricbios-platform-linux -- gpu
cargo test -p fabricbios-core -- gpu_submit
cargo test -p grafos-std -- gpu

Integration tests (requires NVIDIA GPU + CUDA):

Terminal window
cargo test -p fabricbios-platform-linux --features gpu-cuda -- cuda_vector_add_kernel

Live daemon test (requires GPU node running fabricbiosd):

Terminal window
# Start daemon with CUDA support
cargo run -p fabricbiosd --features gpu-cuda -- control-server --auto-detect --quic-bind '[::1]:5701'
# Run GPU_SUBMIT from another terminal
cargo run -p fabricbios-quic-interop --features quic-interop -- client \
--addr [::1]:5701 --insecure --op GPU_SUBMIT \
--gpu-binary vectors/gpu/iota.ptx --gpu-kernel iota \
--gpu-grid 1,1,1 --gpu-block 256,1,1 --gpu-output-size 1024