Recipe 33: GPU as a Pure Function Call

HISTORICAL — API UPDATED. This recipe was written against the v0 gpu_submit SDK surface (FabricGpu::submit(...).launch() and GpuSubmitBuilder), which was removed from grafos-std in task #57. The “single function call” framing (gpu_call(input)) was a direct reflection of the v0 one-shot builder.

In the v1 session-op model, the equivalent pattern is:
let lease = GpuBuilder::new().min_vram(bytes).acquire()?;
let mut sess = GpuSession::new(&lease);
let buf = sess.mem_alloc(bytes)?;
sess.mem_write(&buf, 0, &input)?;
let module = sess.module_load(&ptx)?;
sess.launch_with_args(&module, "kernel", grid, block,
    KernelArgs::new().push_buffer(&buf))?;
sess.sync()?;
let output = sess.mem_read(&buf, 0, output_len)?;
// buf, module, sess, lease all drop — RAII teardown.
This is still “a single logical operation from the programmer’s perspective,” but explicit about the device-memory and module lifecycle. The v1 decomposition is intentional: it makes the authority model (cap-rights, lease scoping) and the failure modes (per-op SESSION_STATUS_*) visible at the call site rather than hiding them inside a one-shot builder.

See Recipe 30: Persistent GPU Sessions for the canonical v1 pattern and docs/grafos-std-guide.md GPU Module section for the full walkthrough. The text below is preserved as historical context for how the v0 “pure function” programming model looked; do not copy the Rust snippets into new code.

Situation

You have a CPU-bound pipeline with one bottleneck step that would be fast on a GPU: a matrix multiply, an FFT, a convolution, image scaling. The step is a pure function — data in, data out, no state carried between calls.

In a conventional setup, using a GPU for this means:

Install CUDA (or ROCm, or OpenCL).
Manage device contexts and memory allocations in your code.
Handle driver version mismatches across machines.
If it’s a cloud GPU: provision an instance, SSH in, install drivers, run your job, tear down.

You don’t want any of that. You want result = gpu_call(input) — like calling a function that happens to run on someone else’s hardware.

What You Build

A pattern where GPU acceleration is a single function call:

Write a CUDA kernel once, compile to PTX offline.
Lease a VRAM slice (just enough for your I/O buffers).
gpu.submit(kernel, ptx).arg(&input).launch() → output bytes.
Drop the lease.

No CUDA import in your code. No device memory management. No context lifecycle. The node handles all of it.

Building Blocks

grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu, GpuSubmitBuilder} — source
fabricbios-core::gpu_submit — wire format (opcode 0x0600) — source

Design

GPU_SUBMIT Is an RPC

GPU_SUBMIT sends a compiled kernel binary and arguments over the QUIC control stream to a GPU node. The node loads the PTX, launches the kernel, synchronizes, copies output from device memory, and returns the bytes. From your perspective, it’s a remote procedure call with bytes-in, bytes-out semantics.

Your code                         GPU node
────────                          ────────
gpu.submit("fft", ptx)
  .arg(&signal_bytes)
  .max_output(signal_len)
  .launch()
    → GPU_SUBMIT ──QUIC──→  cuModuleLoadData(ptx)
                                  cuModuleGetFunction("fft")
                                  cuLaunchKernel(...)
                                  cuCtxSynchronize()
                                  cuMemcpyDtoH(output)
    ← output bytes ◄──────

One network round-trip. No device context in your process. No CUDA headers. No driver dependency.

When This Is the Right Pattern

Use stateless GPU dispatch when:

The function is pure: same input always produces same output. No accumulated state.
Each call is independent: no need to keep data on the GPU between calls.
The compute dominates the transfer: if you’re sending 1 KB and getting back 1 KB, the QUIC round-trip is negligible. If you’re sending 1 GB, consider a session (Recipe 30) to avoid re-uploading.
You want simplicity: one call, one result, no cleanup.

Don’t use this pattern when:

You need multiple kernel launches on shared device memory → use sessions (Recipe 30).
You’re running the same kernel thousands of times with different args → use burst (Recipe 32).
The data is already on the GPU from a prior step → sessions keep it there.

Contrast with Recipe 29

Recipe 29 is an API tutorial: “here’s how GPU_SUBMIT works, field by field.” This recipe is a pattern: when and why to treat GPU compute as a pure function, and how to integrate it into a larger pipeline without restructuring your code around GPU lifecycle.

Walkthrough (Implementation Sketch)

1. Write the Kernel (One-Time, Offline)

// scale.cu — multiply every element by a factor
extern "C" __global__ void scale(float* data, int n, float factor) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= factor;
}

Compile: nvcc --ptx scale.cu -o scale.ptx

This is the only step that touches CUDA tooling. Your application code never imports CUDA.

2. Wrap GPU Dispatch as a Function

use grafos_std::gpu::GpuBuilder;

fn gpu_scale(input: &[f32], factor: f32) -> grafos_std::Result<Vec<f32>> {
    let ptx = include_bytes!("../vectors/gpu/scale.ptx");
    let n = input.len() as u32;
    let input_bytes: Vec<u8> = input.iter().flat_map(|f| f.to_ne_bytes()).collect();
    let output_size = input_bytes.len();

    // Lease just enough VRAM for input + output
    let lease = GpuBuilder::new()
        .min_vram((output_size * 2) as u64)
        .lease_secs(30)
        .acquire()?;

    let result = lease.gpu()
        .submit("scale", ptx)
        .grid([((n + 255) / 256), 1, 1])
        .block([256, 1, 1])
        .arg(&input_bytes)
        .arg(&n.to_ne_bytes())
        .arg(&factor.to_ne_bytes())
        .max_output(output_size)
        .launch()?;

    // Lease drops here — VRAM freed.

    let floats: Vec<f32> = result.output
        .chunks_exact(4)
        .map(|c| f32::from_ne_bytes(c.try_into().unwrap()))
        .collect();
    Ok(floats)
}

3. Call It Like Any Other Function

fn pipeline(raw_signal: &[f32]) -> grafos_std::Result<Vec<f32>> {
    // CPU step 1: normalize
    let normalized = normalize(raw_signal);

    // GPU step: scale by calibration factor (the bottleneck)
    let scaled = gpu_scale(&normalized, 2.5)?;

    // CPU step 2: filter
    let filtered = low_pass_filter(&scaled);

    Ok(filtered)
}

The caller doesn’t know or care that gpu_scale runs on remote hardware. It takes a slice, returns a Vec. The lease is acquired and dropped inside the function — invisible to the caller.

4. Reuse the Lease for Multiple Calls (Optional)

If you’re calling the function in a loop, acquire the lease once outside the loop:

let lease = GpuBuilder::new()
    .min_vram(8 * 1024 * 1024)
    .lease_secs(120)
    .acquire()?;

for batch in batches {
    let ptx = include_bytes!("../vectors/gpu/scale.ptx");
    let result = lease.gpu()
        .submit("scale", ptx)
        .grid([((batch.len() as u32 + 255) / 256), 1, 1])
        .block([256, 1, 1])
        .arg(&batch_to_bytes(batch))
        .arg(&(batch.len() as u32).to_ne_bytes())
        .arg(&factor.to_ne_bytes())
        .max_output(batch.len() * 4)
        .launch()?;
    process_result(&result.output);
}

drop(lease);

Each submit still creates and destroys a context on the node, but the lease (VRAM reservation) persists. This avoids the lease acquisition overhead per call while keeping the stateless dispatch semantics.

Failure Modes

LeaseExpired: The lease TTL elapsed before the kernel completed. Use a longer lease_secs or renew with lease.renew().
STATUS_LOAD_FAILED: PTX compilation failed on the node. Architecture mismatch — recompile without -arch so the driver JIT-compiles for whatever GPU is present.
STATUS_LAUNCH_FAILED: Too many threads per block, invalid kernel name, or argument count mismatch.
CapacityExceeded: No free VRAM on any node. Wait and retry, or reduce min_vram.
Disconnected: Node went away mid-execution. Retry the call — it’s stateless, so safe to re-invoke.

Observability

gpu_submit_total / gpu_submit_errors — call success rate
gpu_submit_output_bytes — data volume returned
gpu_lease_acquired / gpu_lease_dropped — lease churn (should be symmetric)
gpu_vram_allocated_bytes — ensure you’re not over-requesting
Latency histogram: time from launch() call to return — dominated by kernel execution, not framework overhead

Variations

FFT: Compile a cuFFT-based kernel to PTX. Send time-domain signal, get frequency-domain back.
Image resize: Send raw pixels, get resized pixels. Useful in ingest pipelines where one step needs GPU scaling.
Cryptographic hashing: GPU-accelerated SHA-256 over a large dataset. Send chunks, get hashes.
Sorting: GPU radix sort for large arrays. Faster than CPU for 10M+ elements.
Decompression: GPU-accelerated zstd or LZ4 decompression of large blocks.

Testing

cargo test -p fabricbios-core -- gpu_submit  # wire format roundtrips
grafos deploy run --requires gpu --tasklet gpu-pure-function --json