Skip to content

Recipe 52: Clean Preemptible GPU Training Job

Situation

You want to use cheap interruptible GPU capacity for training, but normal cloud preemption is operationally messy: a job can lose the instance, leak GPU or RDMA state, miss a checkpoint, or replay an optimizer step twice.

In grafOS, the job is structured around real leases and replicated resources. The GPU is scoped to a short lease, device memory and loaded modules are tied to that lease, progress is recorded in a replicated checkpoint, and each training step has an idempotency key.

What You Build

A training step runner that:

  • acquires a short GPU lease with session exclusivity;
  • launches a kernel through GpuSession;
  • explicitly unloads the module and frees device memory;
  • saves the model state with ReplicatedCheckpoint;
  • records the step in ReplicatedIdempotencyStore;
  • fails closed if the lease has already been revoked.

The compiled recipe lives in cookbook/recipe-52-clean-preemptible-gpu-training.

Core grafOS API Path

The recipe helper for lease acquisition is just the public GPU builder:

use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass};
let lease = GpuBuilder::new()
.min_vram(16 * 1024 * 1024 * 1024)
.lease_secs(120)
.exclusivity(GpuExclusivityClass::SessionExclusive)
.acquire()?;
# Ok::<(), grafos_std::error::GrafosError>(())

The training state helper constructs a replicated checkpoint and idempotency store:

use fabricbios_core::lease::FenceEpoch;
use grafos_replicated::{
FailureDomain, LogicalResourceName, PlacementPolicy, PolicyHash,
ReplicaHealth, ReplicaId, ReplicaLocator, ReplicaPolicy, ReplicaRole,
ReplicaSetLocator, ReplicatedCheckpoint, ReplicatedIdempotencyStore,
ResourceGeneration, SchemaId,
};
let domain = FailureDomain::cell("local", "cookbook");
let replicas = ReplicaPolicy::new(PlacementPolicy::new().allow(domain.clone()))
.min_replicas(1)
.write_quorum(1)
.read_quorum(1);
let generation = ResourceGeneration(1);
let locator = ReplicaSetLocator::new(
generation,
vec![ReplicaLocator {
replica_id: ReplicaId::new("training-local-a"),
domain,
role: ReplicaRole::Voter,
health: ReplicaHealth::Healthy,
epoch: FenceEpoch(1),
content_generation: generation.0,
}],
);
let checkpoints = ReplicatedCheckpoint::new(
LogicalResourceName::new("preemptible-training-checkpoints"),
SchemaId::new("training-checkpoint.v1"),
FenceEpoch(1),
replicas.clone(),
locator.clone(),
PolicyHash([52; 32]),
)?;
let effects = ReplicatedIdempotencyStore::new(
LogicalResourceName::new("preemptible-training-effects"),
SchemaId::new("training-effect.v1"),
FenceEpoch(1),
replicas,
locator,
)?;
# let _ = (checkpoints, effects);
# Ok::<(), grafos_replicated::ReplicatedError>(())

Inside the step runner, the useful grafOS calls are the lease status check, idempotency reservation, GPU session lifecycle, checkpoint CAS, and effect completion:

use fabricbios_core::lease::FenceEpoch;
use fabricbios_core::state::LeaseStatus;
use grafos_replicated::{CheckpointName, IdempotencyOutcome};
use grafos_std::gpu::{GpuSession, KernelArgs};
if lease.status() != LeaseStatus::Active {
return Err(TrainingError::LeaseNotActive(lease.status()));
}
let reservation = effects.reserve(effect_key.clone(), fingerprint, None, FenceEpoch(1))?;
if matches!(reservation.value.outcome, IdempotencyOutcome::Completed { .. }) {
return Ok(TrainingStepOutcome::Duplicate { job_id, step });
}
let mut session = GpuSession::new(&lease);
let input = session.mem_alloc(batch.input.len() as u64)?;
session.mem_write(&input, 0, &batch.input)?;
let module = session.module_load(module_bytes)?;
let args = KernelArgs::new()
.push_u64(batch.step)
.push_u32(batch.output_len)
.push_buffer(&input);
session.launch_with_args(&module, kernel, [1, 1, 1], [1, 1, 1], args)?;
session.sync()?;
let output = session.mem_read(&input, 0, batch.output_len)?;
session.module_unload(module)?;
session.mem_free(input)?;
let saved = checkpoints.save_bytes(
CheckpointName::new(format!("job:{}:latest", batch.job_id)),
expected_checkpoint_version,
&checkpoint_bytes,
FenceEpoch(1),
)?;
effects.complete(
effect_key,
reservation.version,
IdempotencyOutcome::Completed { effect: None },
FenceEpoch(1),
)?;
# let _ = (output, saved);
# Ok::<(), cookbook_recipe_52_clean_preemptible_gpu_training::TrainingError>(())

Program

use cookbook_recipe_52_clean_preemptible_gpu_training::{
acquire_preemptible_training_lease, replicated_training_state, run_training_step,
TrainingBatch, TrainingStepOutcome,
};
use grafos_replicated::Version;
let (mut checkpoints, mut effects) = replicated_training_state()?;
let lease = acquire_preemptible_training_lease(16 * 1024 * 1024 * 1024, 120)?;
let outcome = run_training_step(
&mut checkpoints,
&mut effects,
&lease,
include_bytes!("train_step.ptx"),
"train_step",
TrainingBatch {
job_id: "spot-resnet".into(),
step: 42,
input: vec![1, 2, 3, 4],
output_len: 4,
},
Version(0),
)?;
assert!(matches!(outcome, TrainingStepOutcome::Completed { .. }));
# Ok::<(), cookbook_recipe_52_clean_preemptible_gpu_training::TrainingError>(())

Why This Is Different

Traditional preemptible training usually combines VM lifecycle hooks, a cloud queue, a database row, object storage, cleanup scripts, and retry glue. This recipe keeps the failure boundary in the program:

  • the lease owns GPU residency;
  • the session owns device allocations and module loads;
  • the checkpoint owns durable progress;
  • the idempotency store owns duplicate suppression.

If a lease is revoked before work starts, run_training_step returns LeaseNotActive and does not submit GPU work. If a duplicate step arrives with the same fingerprint, the effect store returns the existing record instead of performing a second logical step.

Failure Modes

  • Lease revoked before launch: fail closed before opening a GPU session.
  • Process exits mid-step: GPU resources are scoped to the lease and session handles; committed progress is only the latest replicated checkpoint.
  • Duplicate scheduler retry: the idempotency key suppresses the duplicate logical effect.
  • Checkpoint mismatch: the checkpoint CAS version rejects stale writers.

Tests

Run it with:

Terminal window
cargo test -p cookbook-recipe-52-clean-preemptible-gpu-training

The tests cover successful checkpointing and fail-closed revoked leases using the official local host backend from grafos-std.

See also:

  • crates/grafos-std/src/gpu.rs
  • crates/grafos-replicated