Skip to content

Recipe 18: Borrowed GPU Studio

Situation

Many users need GPUs intermittently. Dedicated GPU instances waste capacity.

You want:

  • fast session start
  • strong teardown
  • fair sharing

This recipe complements Recipe 11 (single-user, cell-level preemption) by focusing on multi-tenant session management: admission control, renewal policy, and metering.

What You Build

A multi-user GPU session service:

  • each session acquires a GPU lease sized to its needs
  • session lifetime is the lease lifetime (renew on activity)
  • sessions checkpoint state for resume
  • usage metering per session

Building Blocks

  • GpuBuilder leases
  • Durable checkpoints (optional)
  • grafos_observe metering

Related API docs:

Design

Fairness

Use short TTL by default; renew only for active sessions.

Admission Control

If capacity is full:

  • queue
  • or offer a smaller VRAM lease

Session State Model

Each session should have a minimal durable record:

  • session_id
  • user_id (or tenant id)
  • requested VRAM and TTL policy
  • checkpoint locator (optional)
  • last activity timestamp

If the session service restarts, it can rebuild in-memory tracking from this record.

Renewal Policy

Renewal is the “heartbeat” of activity:

  • renew only when the session performs work
  • renew before TTL expiry (60-80% of TTL)
  • cap total continuous renewal time if you need fairness (e.g. max 30 minutes without releasing)

If renewal fails, treat as explicit preemption and surface that to the user.

Walkthrough

Core grafOS API Path

The session is a GPU lease plus optional checkpoint storage and renewal tracking:

use grafos_collections::durable::Durable;
use grafos_leasekit::{RenewalManager, RenewalPolicy};
use grafos_std::block::BlockBuilder;
use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass, GpuSession};
let lease = GpuBuilder::new()
.min_vram(16 * 1024 * 1024 * 1024)
.lease_secs(300)
.exclusivity(GpuExclusivityClass::SessionExclusive)
.acquire()?;
let mut renewals = RenewalManager::new();
renewals.register(
lease.lease_id(),
lease.expires_at_unix_secs(),
RenewalPolicy::default(),
);
let ckpt_lease = BlockBuilder::new().min_blocks(512).lease_secs(3600).acquire()?;
let model_state = vec![0u8; 1024];
let checkpoint = Durable::new(model_state, ckpt_lease);
checkpoint.checkpoint()?;
let session = GpuSession::new(&lease);
let status = lease.status();
# let _ = (renewals, checkpoint, session, status);
# Ok::<(), grafos_std::FabricError>(())

The service renews only while the user is active. If renewal stops or the lease is revoked, the session ends and the user restores from the checkpoint after acquiring a new lease.

  1. user requests session
  2. acquire GPU lease
  3. serve interactive infer/render
  4. renew on activity
  5. drop on idle/expiry

Failure Modes

  • LeaseExpired: session ended; user must reacquire and restore from checkpoint.
  • Disconnected: transient fabric failure; retry or fail with clear message.
  • Overcommit: admission control bug; fix with hard caps and backpressure.

Observability

Meter per session:

  • VRAM leased
  • GPU-seconds consumed
  • kernel launches and errors
  • checkpoint bytes written

Expose per-tenant aggregates to detect noisy neighbors.

Variations

  • priority tiers (interactive > batch)
  • “spot sessions” that may be preempted aggressively
  • multi-GPU sessions for large jobs