Recipe 18: Borrowed GPU Studio

Situation

Many users need GPUs intermittently. Dedicated GPU instances waste capacity.

You want:

fast session start
strong teardown
fair sharing

This recipe complements Recipe 11 (single-user, cell-level preemption) by focusing on multi-tenant session management: admission control, renewal policy, and metering.

What You Build

A multi-user GPU session service:

each session acquires a GPU lease sized to its needs
session lifetime is the lease lifetime (renew on activity)
sessions checkpoint state for resume
usage metering per session

Building Blocks

GpuBuilder leases
Durable checkpoints (optional)
grafos_observe metering

Related API docs:

Design

Fairness

Use short TTL by default; renew only for active sessions.

Admission Control

If capacity is full:

queue
or offer a smaller VRAM lease

Session State Model

Each session should have a minimal durable record:

session_id
user_id (or tenant id)
requested VRAM and TTL policy
checkpoint locator (optional)
last activity timestamp

If the session service restarts, it can rebuild in-memory tracking from this record.

Renewal Policy

Renewal is the “heartbeat” of activity:

renew only when the session performs work
renew before TTL expiry (60-80% of TTL)
cap total continuous renewal time if you need fairness (e.g. max 30 minutes without releasing)

If renewal fails, treat as explicit preemption and surface that to the user.

Walkthrough

Core grafOS API Path

The session is a GPU lease plus optional checkpoint storage and renewal tracking:

use grafos_collections::durable::Durable;
use grafos_leasekit::{RenewalManager, RenewalPolicy};
use grafos_std::block::BlockBuilder;
use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass, GpuSession};

let lease = GpuBuilder::new()
    .min_vram(16 * 1024 * 1024 * 1024)
    .lease_secs(300)
    .exclusivity(GpuExclusivityClass::SessionExclusive)
    .acquire()?;

let mut renewals = RenewalManager::new();
renewals.register(
    lease.lease_id(),
    lease.expires_at_unix_secs(),
    RenewalPolicy::default(),
);

let ckpt_lease = BlockBuilder::new().min_blocks(512).lease_secs(3600).acquire()?;
let model_state = vec![0u8; 1024];
let checkpoint = Durable::new(model_state, ckpt_lease);
checkpoint.checkpoint()?;

let session = GpuSession::new(&lease);
let status = lease.status();
# let _ = (renewals, checkpoint, session, status);
# Ok::<(), grafos_std::FabricError>(())

The service renews only while the user is active. If renewal stops or the lease is revoked, the session ends and the user restores from the checkpoint after acquiring a new lease.

user requests session
acquire GPU lease
serve interactive infer/render
renew on activity
drop on idle/expiry

Failure Modes

LeaseExpired: session ended; user must reacquire and restore from checkpoint.
Disconnected: transient fabric failure; retry or fail with clear message.
Overcommit: admission control bug; fix with hard caps and backpressure.

Observability

Meter per session:

VRAM leased
GPU-seconds consumed
kernel launches and errors
checkpoint bytes written

Expose per-tenant aggregates to detect noisy neighbors.

Variations

priority tiers (interactive > batch)
“spot sessions” that may be preempted aggressively
multi-GPU sessions for large jobs