Recipe 5: An Inference Server That Shares GPUs Without Containers

Situation

You have multiple models/services that want GPU access. Traditional GPU utilization problems:

Whole-device allocation wastes VRAM and compute.
Container-per-model isolates processes but does not automatically solve VRAM fragmentation and scheduling.
GPU scheduling stacks are complex and operationally heavy.

In a lease-based model, you aim for:

VRAM scoped to a lease.
Session lifetime scoped to a lease.
Fast teardown on drop or expiry.

The goal of this recipe is to show the pattern: GPU access becomes a leased resource, and inference becomes “acquire -> run -> drop”, with observability.

What You Build

A simple inference service with:

Per-model GPU lease acquisition.
A request/response interface (RPC) that uses leased memory as its hot path.
A teardown model that returns VRAM capacity immediately when the lease is dropped.

Building Blocks

grafos_std::gpu::GpuBuilder and GpuLease
grafos_rpc::{RpcServer, RpcClient} for lease-backed RPC
grafos_tensor (if you model tensors explicitly; optional for the conceptual pattern)
grafos_observe for metering

See:

Design

Resource Model

Per model instance:

Acquire a GPU lease with min_vram(bytes) and an appropriate TTL.
Keep the lease handle for the lifetime of that model instance.

On request:

Deserialize input.
Upload input to GPU (or treat input as bytes passed to a kernel).
Launch kernel(s).
Return result bytes.

Isolation and Safety

This recipe assumes the host/runtime enforces:

Lease scoping of GPU submission APIs.
Fuel/time limits on kernels (or watchdog).
Output size limits.

The Rust side should still implement:

Input size bounds.
Output size bounds.
Explicit error paths for expired/disconnected leases.

Walkthrough (Implementation Sketch)

1. Acquire GPU Lease

use grafos_std::gpu::GpuBuilder;

let gpu = GpuBuilder::new()
    .min_vram(2 * 1024 * 1024 * 1024)
    .lease_secs(300)
    .acquire()?;

2. Submit a Kernel

FabricGpu provides submit(kernel_name, binary); you configure grid/block and pass argument bytes:

let res = gpu.gpu()
    .submit("infer", kernel_binary)
    .grid([256, 1, 1])
    .block([64, 1, 1])
    .arg(input_bytes)
    .launch()?;

3. Serve Requests via RPC

The RPC hot path is shared memory. A colocated client/server can exchange requests without TCP.

You can structure:

a server loop that watches for REQUEST_READY
an RPC handler that does gpu.submit(...).launch()

The cookbook-level point: “inference service” is mostly plumbing; the novel part is the lifecycle and isolation.

Failure Modes

LeaseExpired: model instance loses GPU; return a clear error to client and optionally reacquire.
Disconnected: fabric/runtime unreachable; treat as transient and retry/backoff.
Resource pressure: acquiring VRAM fails; degrade (serve smaller models) or queue.

Observability

Track:

gpu_lease_seconds per model
kernel_launch_total, kernel_fail_total
inference_latency_ms histogram
bytes_in / bytes_out

Variations

Multi-tenant server: multiple model leases in one process, routing requests by model id.
Burst models: models that lease VRAM only while warm, drop when idle.
Batching: combine requests into a single kernel launch.
Admission control: reject if TTL remaining is too small to finish inference safely.