Recipe 18: Borrowed GPU Studio
Situation
Many users need GPUs intermittently. Dedicated GPU instances waste capacity.
You want:
- fast session start
- strong teardown
- fair sharing
This recipe complements Recipe 11 (single-user, cell-level preemption) by focusing on multi-tenant session management: admission control, renewal policy, and metering.
What You Build
A multi-user GPU session service:
- each session acquires a GPU lease sized to its needs
- session lifetime is the lease lifetime (renew on activity)
- sessions checkpoint state for resume
- usage metering per session
Building Blocks
GpuBuilderleasesDurablecheckpoints (optional)grafos_observemetering
Related API docs:
Design
Fairness
Use short TTL by default; renew only for active sessions.
Admission Control
If capacity is full:
- queue
- or offer a smaller VRAM lease
Session State Model
Each session should have a minimal durable record:
session_iduser_id(or tenant id)- requested VRAM and TTL policy
- checkpoint locator (optional)
- last activity timestamp
If the session service restarts, it can rebuild in-memory tracking from this record.
Renewal Policy
Renewal is the “heartbeat” of activity:
- renew only when the session performs work
- renew before TTL expiry (60-80% of TTL)
- cap total continuous renewal time if you need fairness (e.g. max 30 minutes without releasing)
If renewal fails, treat as explicit preemption and surface that to the user.
Walkthrough
Core grafOS API Path
The session is a GPU lease plus optional checkpoint storage and renewal tracking:
use grafos_collections::durable::Durable;use grafos_leasekit::{RenewalManager, RenewalPolicy};use grafos_std::block::BlockBuilder;use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass, GpuSession};
let lease = GpuBuilder::new() .min_vram(16 * 1024 * 1024 * 1024) .lease_secs(300) .exclusivity(GpuExclusivityClass::SessionExclusive) .acquire()?;
let mut renewals = RenewalManager::new();renewals.register( lease.lease_id(), lease.expires_at_unix_secs(), RenewalPolicy::default(),);
let ckpt_lease = BlockBuilder::new().min_blocks(512).lease_secs(3600).acquire()?;let model_state = vec![0u8; 1024];let checkpoint = Durable::new(model_state, ckpt_lease);checkpoint.checkpoint()?;
let session = GpuSession::new(&lease);let status = lease.status();# let _ = (renewals, checkpoint, session, status);# Ok::<(), grafos_std::FabricError>(())The service renews only while the user is active. If renewal stops or the lease is revoked, the session ends and the user restores from the checkpoint after acquiring a new lease.
- user requests session
- acquire GPU lease
- serve interactive infer/render
- renew on activity
- drop on idle/expiry
Failure Modes
LeaseExpired: session ended; user must reacquire and restore from checkpoint.Disconnected: transient fabric failure; retry or fail with clear message.- Overcommit: admission control bug; fix with hard caps and backpressure.
Observability
Meter per session:
- VRAM leased
- GPU-seconds consumed
- kernel launches and errors
- checkpoint bytes written
Expose per-tenant aggregates to detect noisy neighbors.
Variations
- priority tiers (interactive > batch)
- “spot sessions” that may be preempted aggressively
- multi-GPU sessions for large jobs