Recipe 5: An Inference Server That Shares GPUs Without Containers
Situation
You have multiple models/services that want GPU access. Traditional GPU utilization problems:
- Whole-device allocation wastes VRAM and compute.
- Container-per-model isolates processes but does not automatically solve VRAM fragmentation and scheduling.
- GPU scheduling stacks are complex and operationally heavy.
In a lease-based model, you aim for:
- VRAM scoped to a lease.
- Session lifetime scoped to a lease.
- Fast teardown on drop or expiry.
The goal of this recipe is to show the pattern: GPU access becomes a leased resource, and inference becomes “acquire -> run -> drop”, with observability.
What You Build
A simple inference service with:
- Per-model GPU lease acquisition.
- A request/response interface (RPC) that uses leased memory as its hot path.
- A teardown model that returns VRAM capacity immediately when the lease is dropped.
Building Blocks
grafos_std::gpu::GpuBuilderandGpuLeasegrafos_rpc::{RpcServer, RpcClient}for lease-backed RPCgrafos_tensor(if you model tensors explicitly; optional for the conceptual pattern)grafos_observefor metering
See:
- GPU leasing and submission API (source)
- lease-backed RPC implementation (source)
- grafos-rpc guide
- grafos-std README
- grafos-rpc README
Design
Resource Model
Per model instance:
- Acquire a GPU lease with
min_vram(bytes)and an appropriate TTL. - Keep the lease handle for the lifetime of that model instance.
On request:
- Deserialize input.
- Upload input to GPU (or treat input as bytes passed to a kernel).
- Launch kernel(s).
- Return result bytes.
Isolation and Safety
This recipe assumes the host/runtime enforces:
- Lease scoping of GPU submission APIs.
- Fuel/time limits on kernels (or watchdog).
- Output size limits.
The Rust side should still implement:
- Input size bounds.
- Output size bounds.
- Explicit error paths for expired/disconnected leases.
Walkthrough (Implementation Sketch)
1. Acquire GPU Lease
use grafos_std::gpu::GpuBuilder;
let gpu = GpuBuilder::new() .min_vram(2 * 1024 * 1024 * 1024) .lease_secs(300) .acquire()?;2. Submit a Kernel
FabricGpu provides submit(kernel_name, binary); you configure grid/block and pass argument bytes:
let res = gpu.gpu() .submit("infer", kernel_binary) .grid([256, 1, 1]) .block([64, 1, 1]) .arg(input_bytes) .launch()?;3. Serve Requests via RPC
The RPC hot path is shared memory. A colocated client/server can exchange requests without TCP.
You can structure:
- a server loop that watches for
REQUEST_READY - an RPC handler that does
gpu.submit(...).launch()
The cookbook-level point: “inference service” is mostly plumbing; the novel part is the lifecycle and isolation.
Failure Modes
LeaseExpired: model instance loses GPU; return a clear error to client and optionally reacquire.Disconnected: fabric/runtime unreachable; treat as transient and retry/backoff.- Resource pressure: acquiring VRAM fails; degrade (serve smaller models) or queue.
Observability
Track:
gpu_lease_secondsper modelkernel_launch_total,kernel_fail_totalinference_latency_mshistogrambytes_in/bytes_out
Variations
- Multi-tenant server: multiple model leases in one process, routing requests by model id.
- Burst models: models that lease VRAM only while warm, drop when idle.
- Batching: combine requests into a single kernel launch.
- Admission control: reject if TTL remaining is too small to finish inference safely.