Recipe 58: Workload That Migrates on Node Drain
Situation
A stateful workload — a queue worker, a stream consumer, a long-
running batch step — has been placed on a specific node. An operator
decides to drain that node for hardware swap. The scheduler walks
the node through Accepting → Draining → Drained → Maintenance → Returning → Accepting. The workload needs to:
- Stop admitting new work the moment the node leaves
Accepting; - Finish whatever’s in-flight, then checkpoint cleanly;
- Persist the final checkpoint atomically when the node confirms
Drained; - Stay quiet through
MaintenanceandReturning; - Resume from the checkpoint when the node (or a successor) goes
back to
Accepting; - Halt fail-closed if the fence ack is lost mid-drain
(
FenceLost) — operator intervention required before resuming.
The grafOS scheduler emits typed NodeModeTransition events on every
state change. This recipe is the in-workload state machine that
consumes those events and produces the right operational action.
What You Build
A pure decision function + a lifecycle state struct that:
- Takes a typed
NodeModeTransition, validates it against the scheduler’s state machine (NodeMode::legal_transition_to), and refuses garbage events fail-closed; - Emits a typed
WorkloadAction(Accept, CheckpointAndDrain, PersistFinalCheckpoint, Quiesce, ResumeFromCheckpoint, Halt); - Carries minimal lifecycle state (
current_mode,accepts_new_work,has_persisted_checkpoint) the runtime adapter can persist alongside its checkpoint payload.
The compiled recipe lives in
cookbook/recipe-58-workload-on-node-drain.
Core grafOS API Path
use grafos_scheduler_service::{NodeMode, NodeModeTransition, NodeModeTransitionError};
// The state machine is authoritative here:assert!(NodeMode::Accepting.legal_transition_to(NodeMode::Draining));assert!(NodeMode::Draining.legal_transition_to(NodeMode::Drained));assert!(NodeMode::Drained.legal_transition_to(NodeMode::Maintenance));assert!(NodeMode::Maintenance.legal_transition_to(NodeMode::Returning));assert!(NodeMode::Returning.legal_transition_to(NodeMode::Accepting));
// The fault path:assert!(NodeMode::Drained.legal_transition_to(NodeMode::FenceLost));assert!(NodeMode::FenceLost.legal_transition_to(NodeMode::Draining));assert!(NodeMode::FenceLost.legal_transition_to(NodeMode::Drained));
// Illegal transitions return the typed error:let illegal = NodeModeTransition { node_id: 0x42, from: NodeMode::Accepting, to: NodeMode::Drained,};if !illegal.from.legal_transition_to(illegal.to) { let _err = NodeModeTransitionError::IllegalTransition { from: illegal.from, to: illegal.to, };}Program
use cookbook_recipe_58_workload_on_node_drain::{ action_for_transition, validate_transition, WorkloadAction, WorkloadLifecycle,};use grafos_scheduler_service::{NodeMode, NodeModeTransition};
let mut state = WorkloadLifecycle::fresh_on_accepting_node();
// Runtime adapter delivers a transition. Apply, get the action,// route the action to the workload's executor.let event = NodeModeTransition { node_id: 0x42, from: NodeMode::Accepting, to: NodeMode::Draining,};match state.apply(event)? { WorkloadAction::Accept => {/* keep admitting new work */} WorkloadAction::CheckpointAndDrain => { // Refuse new work; finish in-flight; checkpoint when // in-flight drops to zero. } WorkloadAction::PersistFinalCheckpoint => { // Drain reached fence ack; atomically write the final // checkpoint somewhere durable. } WorkloadAction::Quiesce => {/* Maintenance / Returning */} WorkloadAction::ResumeFromCheckpoint => { // Node is back; resume from the most recent checkpoint // (here or on another node, depending on the runtime). } WorkloadAction::Halt => { // Fault — fence ack lost or unexpected transition. Refuse // until an operator forensically clears. }}# Ok::<(), grafos_scheduler_service::NodeModeTransitionError>(())Design
The recipe splits the policy into three orthogonal pieces:
validate_transition— checks the transition against the scheduler’s state machine (NodeMode::legal_transition_to). This is the same predicate the scheduler enforces on the producer side, so a workload that validates here rejects exactly the same garbage events the scheduler would.action_for_transition— pure mapping from(from, to)to aWorkloadAction. No state, no side effects. Easy to unit-test in isolation and easy to evolve when new transitions are added.WorkloadLifecycle::apply— composes validate + action, updates the in-process state (current_mode,accepts_new_work,has_persisted_checkpoint), and surfaces the action for the runtime to execute.
The WorkloadAction enum is closed-set: an unexpected transition
maps to Halt, not Ignore. This is the fail-closed default —
a workload running against an unknown state machine refuses to
guess what to do.
The FenceLost fault path is explicit. The scheduler only emits
Drained → FenceLost when the fence ack went missing. Two
transitions out of FenceLost are legal — back to Draining
(re-attempt) and to Drained (operator declared the prior fence
valid forensically) — but the workload halts in either case
until the operator’s intervention clears the state. The recipe
encodes this: any transition where from == FenceLost returns
WorkloadAction::Halt.
Failure Modes
- Illegal transition delivered (event-bus glitch, stale
message):
NodeModeTransitionError::IllegalTransition { from, to }returned fromapply. The workload’s lifecycle state does NOT mutate. - Fence ack lost (
Drained → FenceLost): workload halts. The runtime should surface this on the operator alert channel. - Unexpected legal transition (a pair the state machine
accepts but the recipe doesn’t model — e.g. a future
addition): maps to
WorkloadAction::Haltso the workload doesn’t silently misinterpret. - Idempotent same-state delivery: a duplicate
Accepting → Accepting(or any other same-state pair) is accepted by the state machine, returnsWorkloadAction::Accept. Polling loops re-applying observed state don’t synthesize failures.
Tests
Run it with:
cargo test -p cookbook-recipe-58-workload-on-node-drainSix tests cover the full drain-and-rejoin lifecycle (5 transitions,
each producing the right WorkloadAction), FenceLost halting,
FenceLost recovery still halting (workload refuses to act until
operator clears), illegal-transition rejection with typed error,
idempotent same-state update accepted, and a
validate_transition parity check against the state machine.
Adaptation Notes
- Checkpoint payload:
WorkloadLifecyclecarries only the workload-lifecycle bits. Production callers wrap it with the domain-specific checkpoint state (in-flight queue positions, partial results, retry counters). The recipe’shas_persisted_checkpointboolean signals that the workload CALLED its persist hook; the runtime is responsible for what’s actually durable. - Cross-node migration:
WorkloadAction::ResumeFromCheckpointdoesn’t pin “where” — that’s the runtime adapter’s call. A typical adapter has its own placement input (the same node post-Returning → Accepting, or a different node when the drained one stays inMaintenance). - Operator-facing telemetry: emit the transition + chosen
WorkloadActionas a typed event (snake_caseas_str()— seedocs/operations/siem-vocabulary-cookbook.md) so the dashboard surface shows drain progress per workload. - Multi-node workloads: the recipe scopes one workload to
one node. Workloads that span multiple nodes (replicated
services, distributed batches) run one
WorkloadLifecycleper node and aggregate the action stream — e.g. when a 3-replica service sees one node hitCheckpointAndDrain, the orchestrator can pre-warm a fourth replica before the drain completes.
See also:
crates/grafos-scheduler-service/src/lib.rs—NodeMode,NodeModeTransition,NodeModeTransitionError,legal_transition_to.docs/operations/scheduler-features.md§ “Node mode lifecycle” — operator-facing reference + state-machine diagram.docs/operations/siem-vocabulary-cookbook.md— log filters keyed onkind == "node_mode_transition".grafos admin manage-node --node <id> --mode <name>— the CLI surface operators use to drive the state machine.