Skip to content

Recipe 33: GPU as a Pure Function Call

HISTORICAL — API UPDATED. This recipe was written against the v0 gpu_submit SDK surface (FabricGpu::submit(...).launch() and GpuSubmitBuilder), which was removed from grafos-std in task #57. The “single function call” framing (gpu_call(input)) was a direct reflection of the v0 one-shot builder.

In the v1 session-op model, the equivalent pattern is:

let lease = GpuBuilder::new().min_vram(bytes).acquire()?;
let mut sess = GpuSession::new(&lease);
let buf = sess.mem_alloc(bytes)?;
sess.mem_write(&buf, 0, &input)?;
let module = sess.module_load(&ptx)?;
sess.launch_with_args(&module, "kernel", grid, block,
KernelArgs::new().push_buffer(&buf))?;
sess.sync()?;
let output = sess.mem_read(&buf, 0, output_len)?;
// buf, module, sess, lease all drop — RAII teardown.

This is still “a single logical operation from the programmer’s perspective,” but explicit about the device-memory and module lifecycle. The v1 decomposition is intentional: it makes the authority model (cap-rights, lease scoping) and the failure modes (per-op SESSION_STATUS_*) visible at the call site rather than hiding them inside a one-shot builder.

See Recipe 30: Persistent GPU Sessions for the canonical v1 pattern and docs/grafos-std-guide.md GPU Module section for the full walkthrough. The text below is preserved as historical context for how the v0 “pure function” programming model looked; do not copy the Rust snippets into new code.

Situation

You have a CPU-bound pipeline with one bottleneck step that would be fast on a GPU: a matrix multiply, an FFT, a convolution, image scaling. The step is a pure function — data in, data out, no state carried between calls.

In a conventional setup, using a GPU for this means:

  • Install CUDA (or ROCm, or OpenCL).
  • Manage device contexts and memory allocations in your code.
  • Handle driver version mismatches across machines.
  • If it’s a cloud GPU: provision an instance, SSH in, install drivers, run your job, tear down.

You don’t want any of that. You want result = gpu_call(input) — like calling a function that happens to run on someone else’s hardware.

What You Build

A pattern where GPU acceleration is a single function call:

  • Write a CUDA kernel once, compile to PTX offline.
  • Lease a VRAM slice (just enough for your I/O buffers).
  • gpu.submit(kernel, ptx).arg(&input).launch() → output bytes.
  • Drop the lease.

No CUDA import in your code. No device memory management. No context lifecycle. The node handles all of it.

Building Blocks

  • grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu, GpuSubmitBuilder}source
  • fabricbios-core::gpu_submit — wire format (opcode 0x0600) — source

See also:

Design

GPU_SUBMIT Is an RPC

GPU_SUBMIT sends a compiled kernel binary and arguments over the QUIC control stream to a GPU node. The node loads the PTX, launches the kernel, synchronizes, copies output from device memory, and returns the bytes. From your perspective, it’s a remote procedure call with bytes-in, bytes-out semantics.

Your code GPU node
──────── ────────
gpu.submit("fft", ptx)
.arg(&signal_bytes)
.max_output(signal_len)
.launch()
→ GPU_SUBMIT ──QUIC──→ cuModuleLoadData(ptx)
cuModuleGetFunction("fft")
cuLaunchKernel(...)
cuCtxSynchronize()
cuMemcpyDtoH(output)
← output bytes ◄──────

One network round-trip. No device context in your process. No CUDA headers. No driver dependency.

When This Is the Right Pattern

Use stateless GPU dispatch when:

  • The function is pure: same input always produces same output. No accumulated state.
  • Each call is independent: no need to keep data on the GPU between calls.
  • The compute dominates the transfer: if you’re sending 1 KB and getting back 1 KB, the QUIC round-trip is negligible. If you’re sending 1 GB, consider a session (Recipe 30) to avoid re-uploading.
  • You want simplicity: one call, one result, no cleanup.

Don’t use this pattern when:

  • You need multiple kernel launches on shared device memory → use sessions (Recipe 30).
  • You’re running the same kernel thousands of times with different args → use burst (Recipe 32).
  • The data is already on the GPU from a prior step → sessions keep it there.

Contrast with Recipe 29

Recipe 29 is an API tutorial: “here’s how GPU_SUBMIT works, field by field.” This recipe is a pattern: when and why to treat GPU compute as a pure function, and how to integrate it into a larger pipeline without restructuring your code around GPU lifecycle.

Walkthrough (Implementation Sketch)

1. Write the Kernel (One-Time, Offline)

// scale.cu — multiply every element by a factor
extern "C" __global__ void scale(float* data, int n, float factor) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] *= factor;
}

Compile: nvcc --ptx scale.cu -o scale.ptx

This is the only step that touches CUDA tooling. Your application code never imports CUDA.

2. Wrap GPU Dispatch as a Function

use grafos_std::gpu::GpuBuilder;
fn gpu_scale(input: &[f32], factor: f32) -> grafos_std::Result<Vec<f32>> {
let ptx = include_bytes!("../vectors/gpu/scale.ptx");
let n = input.len() as u32;
let input_bytes: Vec<u8> = input.iter().flat_map(|f| f.to_ne_bytes()).collect();
let output_size = input_bytes.len();
// Lease just enough VRAM for input + output
let lease = GpuBuilder::new()
.min_vram((output_size * 2) as u64)
.lease_secs(30)
.acquire()?;
let result = lease.gpu()
.submit("scale", ptx)
.grid([((n + 255) / 256), 1, 1])
.block([256, 1, 1])
.arg(&input_bytes)
.arg(&n.to_ne_bytes())
.arg(&factor.to_ne_bytes())
.max_output(output_size)
.launch()?;
// Lease drops here — VRAM freed.
let floats: Vec<f32> = result.output
.chunks_exact(4)
.map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
.collect();
Ok(floats)
}

3. Call It Like Any Other Function

fn pipeline(raw_signal: &[f32]) -> grafos_std::Result<Vec<f32>> {
// CPU step 1: normalize
let normalized = normalize(raw_signal);
// GPU step: scale by calibration factor (the bottleneck)
let scaled = gpu_scale(&normalized, 2.5)?;
// CPU step 2: filter
let filtered = low_pass_filter(&scaled);
Ok(filtered)
}

The caller doesn’t know or care that gpu_scale runs on remote hardware. It takes a slice, returns a Vec. The lease is acquired and dropped inside the function — invisible to the caller.

4. Reuse the Lease for Multiple Calls (Optional)

If you’re calling the function in a loop, acquire the lease once outside the loop:

let lease = GpuBuilder::new()
.min_vram(8 * 1024 * 1024)
.lease_secs(120)
.acquire()?;
for batch in batches {
let ptx = include_bytes!("../vectors/gpu/scale.ptx");
let result = lease.gpu()
.submit("scale", ptx)
.grid([((batch.len() as u32 + 255) / 256), 1, 1])
.block([256, 1, 1])
.arg(&batch_to_bytes(batch))
.arg(&(batch.len() as u32).to_ne_bytes())
.arg(&factor.to_ne_bytes())
.max_output(batch.len() * 4)
.launch()?;
process_result(&result.output);
}
drop(lease);

Each submit still creates and destroys a context on the node, but the lease (VRAM reservation) persists. This avoids the lease acquisition overhead per call while keeping the stateless dispatch semantics.

Failure Modes

  • LeaseExpired: The lease TTL elapsed before the kernel completed. Use a longer lease_secs or renew with lease.renew().
  • STATUS_LOAD_FAILED: PTX compilation failed on the node. Architecture mismatch — recompile without -arch so the driver JIT-compiles for whatever GPU is present.
  • STATUS_LAUNCH_FAILED: Too many threads per block, invalid kernel name, or argument count mismatch.
  • CapacityExceeded: No free VRAM on any node. Wait and retry, or reduce min_vram.
  • Disconnected: Node went away mid-execution. Retry the call — it’s stateless, so safe to re-invoke.

Observability

  • gpu_submit_total / gpu_submit_errors — call success rate
  • gpu_submit_output_bytes — data volume returned
  • gpu_lease_acquired / gpu_lease_dropped — lease churn (should be symmetric)
  • gpu_vram_allocated_bytes — ensure you’re not over-requesting
  • Latency histogram: time from launch() call to return — dominated by kernel execution, not framework overhead

Variations

  • FFT: Compile a cuFFT-based kernel to PTX. Send time-domain signal, get frequency-domain back.
  • Image resize: Send raw pixels, get resized pixels. Useful in ingest pipelines where one step needs GPU scaling.
  • Cryptographic hashing: GPU-accelerated SHA-256 over a large dataset. Send chunks, get hashes.
  • Sorting: GPU radix sort for large arrays. Faster than CPU for 10M+ elements.
  • Decompression: GPU-accelerated zstd or LZ4 decompression of large blocks.

Testing

Terminal window
cargo test -p fabricbios-core -- gpu_submit # wire format roundtrips
grafos deploy run --requires gpu --tasklet gpu-pure-function --json