Recipe 52: Clean Preemptible GPU Training Job
Situation
You want to use cheap interruptible GPU capacity for training, but normal cloud preemption is operationally messy: a job can lose the instance, leak GPU or RDMA state, miss a checkpoint, or replay an optimizer step twice.
In grafOS, the job is structured around real leases and replicated resources. The GPU is scoped to a short lease, device memory and loaded modules are tied to that lease, progress is recorded in a replicated checkpoint, and each training step has an idempotency key.
What You Build
A training step runner that:
- acquires a short GPU lease with session exclusivity;
- launches a kernel through
GpuSession; - explicitly unloads the module and frees device memory;
- saves the model state with
ReplicatedCheckpoint; - records the step in
ReplicatedIdempotencyStore; - fails closed if the lease has already been revoked.
The compiled recipe lives in
cookbook/recipe-52-clean-preemptible-gpu-training.
Core grafOS API Path
The recipe helper for lease acquisition is just the public GPU builder:
use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass};
let lease = GpuBuilder::new() .min_vram(16 * 1024 * 1024 * 1024) .lease_secs(120) .exclusivity(GpuExclusivityClass::SessionExclusive) .acquire()?;# Ok::<(), grafos_std::error::GrafosError>(())The training state helper constructs a replicated checkpoint and idempotency store:
use fabricbios_core::lease::FenceEpoch;use grafos_replicated::{ FailureDomain, LogicalResourceName, PlacementPolicy, PolicyHash, ReplicaHealth, ReplicaId, ReplicaLocator, ReplicaPolicy, ReplicaRole, ReplicaSetLocator, ReplicatedCheckpoint, ReplicatedIdempotencyStore, ResourceGeneration, SchemaId,};
let domain = FailureDomain::cell("local", "cookbook");let replicas = ReplicaPolicy::new(PlacementPolicy::new().allow(domain.clone())) .min_replicas(1) .write_quorum(1) .read_quorum(1);let generation = ResourceGeneration(1);let locator = ReplicaSetLocator::new( generation, vec![ReplicaLocator { replica_id: ReplicaId::new("training-local-a"), domain, role: ReplicaRole::Voter, health: ReplicaHealth::Healthy, epoch: FenceEpoch(1), content_generation: generation.0, }],);let checkpoints = ReplicatedCheckpoint::new( LogicalResourceName::new("preemptible-training-checkpoints"), SchemaId::new("training-checkpoint.v1"), FenceEpoch(1), replicas.clone(), locator.clone(), PolicyHash([52; 32]),)?;let effects = ReplicatedIdempotencyStore::new( LogicalResourceName::new("preemptible-training-effects"), SchemaId::new("training-effect.v1"), FenceEpoch(1), replicas, locator,)?;# let _ = (checkpoints, effects);# Ok::<(), grafos_replicated::ReplicatedError>(())Inside the step runner, the useful grafOS calls are the lease status check, idempotency reservation, GPU session lifecycle, checkpoint CAS, and effect completion:
use fabricbios_core::lease::FenceEpoch;use fabricbios_core::state::LeaseStatus;use grafos_replicated::{CheckpointName, IdempotencyOutcome};use grafos_std::gpu::{GpuSession, KernelArgs};
if lease.status() != LeaseStatus::Active { return Err(TrainingError::LeaseNotActive(lease.status()));}
let reservation = effects.reserve(effect_key.clone(), fingerprint, None, FenceEpoch(1))?;if matches!(reservation.value.outcome, IdempotencyOutcome::Completed { .. }) { return Ok(TrainingStepOutcome::Duplicate { job_id, step });}
let mut session = GpuSession::new(&lease);let input = session.mem_alloc(batch.input.len() as u64)?;session.mem_write(&input, 0, &batch.input)?;let module = session.module_load(module_bytes)?;let args = KernelArgs::new() .push_u64(batch.step) .push_u32(batch.output_len) .push_buffer(&input);session.launch_with_args(&module, kernel, [1, 1, 1], [1, 1, 1], args)?;session.sync()?;let output = session.mem_read(&input, 0, batch.output_len)?;session.module_unload(module)?;session.mem_free(input)?;
let saved = checkpoints.save_bytes( CheckpointName::new(format!("job:{}:latest", batch.job_id)), expected_checkpoint_version, &checkpoint_bytes, FenceEpoch(1),)?;effects.complete( effect_key, reservation.version, IdempotencyOutcome::Completed { effect: None }, FenceEpoch(1),)?;# let _ = (output, saved);# Ok::<(), cookbook_recipe_52_clean_preemptible_gpu_training::TrainingError>(())Program
use cookbook_recipe_52_clean_preemptible_gpu_training::{ acquire_preemptible_training_lease, replicated_training_state, run_training_step, TrainingBatch, TrainingStepOutcome,};use grafos_replicated::Version;
let (mut checkpoints, mut effects) = replicated_training_state()?;let lease = acquire_preemptible_training_lease(16 * 1024 * 1024 * 1024, 120)?;
let outcome = run_training_step( &mut checkpoints, &mut effects, &lease, include_bytes!("train_step.ptx"), "train_step", TrainingBatch { job_id: "spot-resnet".into(), step: 42, input: vec![1, 2, 3, 4], output_len: 4, }, Version(0),)?;
assert!(matches!(outcome, TrainingStepOutcome::Completed { .. }));# Ok::<(), cookbook_recipe_52_clean_preemptible_gpu_training::TrainingError>(())Why This Is Different
Traditional preemptible training usually combines VM lifecycle hooks, a cloud queue, a database row, object storage, cleanup scripts, and retry glue. This recipe keeps the failure boundary in the program:
- the lease owns GPU residency;
- the session owns device allocations and module loads;
- the checkpoint owns durable progress;
- the idempotency store owns duplicate suppression.
If a lease is revoked before work starts, run_training_step returns
LeaseNotActive and does not submit GPU work. If a duplicate step arrives with
the same fingerprint, the effect store returns the existing record instead of
performing a second logical step.
Failure Modes
- Lease revoked before launch: fail closed before opening a GPU session.
- Process exits mid-step: GPU resources are scoped to the lease and session handles; committed progress is only the latest replicated checkpoint.
- Duplicate scheduler retry: the idempotency key suppresses the duplicate logical effect.
- Checkpoint mismatch: the checkpoint CAS version rejects stale writers.
Tests
Run it with:
cargo test -p cookbook-recipe-52-clean-preemptible-gpu-trainingThe tests cover successful checkpointing and fail-closed revoked leases using
the official local host backend from grafos-std.
See also:
crates/grafos-std/src/gpu.rscrates/grafos-replicated