Recipe 29: Running a CUDA Kernel on a Leased GPU
HISTORICAL — API UPDATED. This recipe was written against the v0
gpu_submitSDK surface (FabricGpu::submit(...).launch()andGpuSubmitBuilder), which was removed fromgrafos-stdin task #57. The current tasklet-side GPU programming model is the v1 session-op path:GpuBuilder::acquire→GpuSession::new(&lease)→mem_alloc→mem_write→module_load→launch→sync→mem_read→ (RAII drop). See Recipe 30: Persistent GPU Sessions for the canonical pattern anddocs/grafos-std-guide.mdGPU Module section for the full API walkthrough.The v0 QUIC control-plane
GPU_SUBMITwire format (fabricbios-core::gpu_submit::GpuSubmitRequest) that this recipe describes at the daemon/wire level is still supported as a control-plane op for external clients (e.g.fabricbios-quic-interopCLI usage in the “Using the Interop Client Directly” section below still works). Only the tasklet-side SDK wrapper was removed.The recipe text below is preserved as historical context for how the v0 API looked and how it mapped to the daemon-side CUDA operations. Do not copy-paste the Rust snippets into new code.
Situation
You have a compute workload — signal processing, simulation, inference — and you need a GPU for a few seconds or minutes. You don’t want to rent a whole card, manage CUDA drivers, or worry about cleanup if your process crashes. You want:
- A lease on exactly the VRAM you need, not an entire device.
- Submit a compiled kernel as bytes — no CUDA toolkit on the client.
- Output read back after completion.
- Automatic teardown on lease expiry — no orphaned allocations.
What You Build
A program that:
- Discovers a GPU node via fabric inventory.
- Acquires a GPU lease (VRAM allocation on the remote device).
- Submits a compiled PTX kernel with arguments.
- Reads back output from device memory.
- Drops the lease — VRAM is freed, no cleanup needed.
Building Blocks
grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu, GpuSubmitBuilder}fabricbios-coreGPU_SUBMIT wire format (opcode0x0600)fabricbios-platform-linuxCUDA Driver API FFI (cuda_ffimodule,gpu-cudafeature)fabricbios-platform-linuxNVML FFI (nvml_ffimodule) for VRAM discovery
See:
- GPU leasing and submission API (source)
- GPU_SUBMIT wire format (source)
- CUDA/NVML FFI and GpuKernelRunner (source)
- Daemon GPU dispatch (source)
Design
How the Pieces Fit Together
Client Node (fabricbiosd) ────── ────────────────── GpuBuilder::acquire() → LEASE_ALLOC ──QUIC──→ GpuLeaseManager::allocate() cuMemAlloc_v2() → device ptr ← lease_id ◄──────────
gpu.submit("iota", ptx) .grid([1,1,1]) .block([256,1,1]) .arg(&n_bytes) .max_output(1024) .launch() → GPU_SUBMIT ──QUIC──→ GpuKernelRunner::load(ptx) cuModuleLoadData() cuModuleGetFunction("iota") cuLaunchKernel(...) cuCtxSynchronize() cuMemcpyDtoH(output) ← output bytes ◄──────
drop(lease) → MEM_FREE ──QUIC──→ GpuLeaseManager::free() cuMemFree_v2()Control Plane vs Data Plane
Control plane (QUIC stream, fabricBIOS wire protocol):
LEASE_ALLOC— acquires a GPU resource, allocates VRAM on the device.GPU_SUBMIT— sends kernel binary + args, receives output bytes.MEM_FREE— releases the lease and frees device memory.
Data plane (device memory on the GPU node):
cuMemAlloc_v2allocates aCUdeviceptr(device-side buffer).- Kernel arguments can include the device pointer so kernels write results into leased memory.
- After kernel completion, the daemon copies
output_sizebytes fromoutput_offsetin the region back to the client.
There is no host↔device streaming — the kernel binary and output travel over the QUIC control stream. This keeps the protocol simple: one request, one response, synchronous execution.
Kernel Binary Format
CUDA kernels are compiled to PTX (portable) or cubin (architecture-specific). PTX is preferred because the CUDA driver JIT-compiles it for whatever GPU is present:
nvcc --ptx -arch=sm_90 iota.cu -o iota.ptx # H100nvcc --ptx iota.cu -o iota.ptx # any GPU (driver picks arch)The PTX text is sent as the binary field of GPU_SUBMIT. The daemon calls cuModuleLoadData() which accepts PTX directly.
Walkthrough (Implementation Sketch)
1. Write the CUDA Kernel
// iota.cu — writes threadIdx.x + 1 to each output slotextern "C" __global__ void iota(float* out, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) out[i] = (float)(i + 1);}Compile: nvcc --ptx -arch=sm_90 iota.cu -o iota.ptx
2. Acquire a GPU Lease
use grafos_std::gpu::GpuBuilder;
let lease = GpuBuilder::new() .min_vram(1024 * 1024) // 1 MiB .lease_secs(60) .acquire()?;Under the hood this sends LEASE_ALLOC to a GPU node. The node’s GpuLeaseManager calls cuMemAlloc_v2 to reserve a device memory region.
3. Submit the Kernel
let ptx = include_bytes!("../vectors/gpu/iota.ptx");let n: u32 = 256;
let result = lease.gpu() .submit("iota", ptx) .grid([1, 1, 1]) .block([256, 1, 1]) .arg(&n.to_ne_bytes()) // kernel arg: int n .max_output(256 * 4) // read back 256 floats .launch()?;
assert_eq!(result.exit_code, 0);The daemon receives GPU_SUBMIT, loads PTX via cuModuleLoadData, gets the iota function via cuModuleGetFunction, injects the leased region’s device pointer as the first argument (float* out), launches the kernel, synchronizes, and copies output_size bytes back from device memory.
4. Read the Output
let floats: Vec<f32> = result.output .chunks_exact(4) .map(|c| f32::from_ne_bytes(c.try_into().unwrap())) .collect();
// floats[0] = 1.0, floats[1] = 2.0, ..., floats[255] = 256.0for (i, &v) in floats.iter().enumerate() { assert_eq!(v, (i + 1) as f32);}5. Drop the Lease
When lease goes out of scope (or you call lease.free()), MEM_FREE is sent to the node. The node calls cuMemFree_v2 to release the device memory region. No cleanup needed on the client.
Using the Interop Client Directly
You can also test GPU_SUBMIT without writing a grafOS program, using the QUIC interop client:
# Discover GPUscargo run -p fabricbios-quic-interop --features quic-interop -- client \ --addr [::1]:5701 --insecure --op GET_INVENTORY
# Allocate a GPU leasecargo run -p fabricbios-quic-interop --features quic-interop -- client \ --addr [::1]:5701 --insecure --op LEASE_ALLOC --cap-request --daemon
# Submit a kernel (PTX loaded from file)cargo run -p fabricbios-quic-interop --features quic-interop -- client \ --addr [::1]:5701 --insecure --op GPU_SUBMIT \ --gpu-binary vectors/gpu/iota.ptx \ --gpu-kernel iota \ --gpu-grid 1,1,1 --gpu-block 256,1,1 \ --gpu-output-size 1024Failure Modes
STATUS_LOAD_FAILED(3): PTX compilation failed. Usually an architecture mismatch (PTX compiled for sm_90 but GPU is sm_80). Recompile without-archto let the driver choose.STATUS_LAUNCH_FAILED(4): Kernel launch or synchronization failed. Common causes: too many threads per block, out-of-bounds memory access, CUDA context error.STATUS_UNAUTHORIZED(2): No active lease for the GPU resource. CallLEASE_ALLOCfirst.STATUS_READ_FAILED(6):cuMemcpyDtoHfailed reading output. Checkoutput_offsetandoutput_sizeare within the allocated region.LeaseExpired: Lease TTL elapsed before submission completed. Use a longerlease_secsor calllease.renew().CapacityExceeded: Not enough free VRAM for the requested allocation. Another lease is using the GPU.
Observability
Track with grafos_observe:
gpu_lease_acquired/gpu_lease_dropped— lease lifecyclegpu_submit_total/gpu_submit_errors— kernel execution countersgpu_submit_output_bytes— data transferred back from devicegpu_vram_allocated_bytes— current VRAM usage per node
Variations
- Vector add: Two input arrays uploaded via args, output read back. Classic GPU hello-world.
- Batch submissions: Multiple
gpu.submit()calls on the same lease — each reuses the allocated VRAM region. - Large output: Set
max_outputto match your expected result size. The daemon reads exactlyoutput_sizebytes from device memory atoutput_offset. - AMD ROCm: Same API, different binary format. Compile with
hipccto produce HSACO instead of PTX. Usegpu-submitfeature instead ofgpu-cuda. - MIG partitioning: On H100/A100 with MIG enabled, NVML reports each MIG instance as a separate device.
GpuLeaseManager::Fractionalmode maps each MIG instance to one lease slot. - Persistent sessions: For multi-kernel workloads where device memory and loaded modules must persist across launches, see Recipe 30: Persistent GPU Sessions. Sessions keep the CUDA context alive for the lease duration — no re-uploading weights or recompiling PTX between kernels.
Testing
Unit tests (no GPU required):
cargo test -p fabricbios-platform-linux -- gpucargo test -p fabricbios-core -- gpu_submitcargo test -p grafos-std -- gpuIntegration tests (requires NVIDIA GPU + CUDA):
cargo test -p fabricbios-platform-linux --features gpu-cuda -- cuda_vector_add_kernelLive daemon test (requires GPU node running fabricbiosd):
# Start daemon with CUDA supportcargo run -p fabricbiosd --features gpu-cuda -- control-server --auto-detect --quic-bind '[::1]:5701'
# Run GPU_SUBMIT from another terminalcargo run -p fabricbios-quic-interop --features quic-interop -- client \ --addr [::1]:5701 --insecure --op GPU_SUBMIT \ --gpu-binary vectors/gpu/iota.ptx --gpu-kernel iota \ --gpu-grid 1,1,1 --gpu-block 256,1,1 --gpu-output-size 1024