Recipe 28: Stale-Write Rejection in Distributed Leader Election
Situation
In a distributed leader election, the most dangerous failure mode is a stale leader continuing to write after a new leader has been elected. Network partitions, GC pauses, and slow I/O can all cause a leader to believe it is still in charge when it is not.
Traditional approaches use heartbeats and timeouts, but timeouts are a guess about failure. If the timeout is too short, you get false failovers. If it is too long, you have a window where two leaders coexist. You need a mechanism that mechanically rejects writes from stale leaders — not by policy or timeout, but by epoch comparison.
The grafOS fencing primitives give you this. A FenceGuard tracks the current epoch. Every write carries
its epoch. Stale writes are rejected at the point of application, regardless of when they arrive.
What You Build
A leader election system where:
FabricMutexprovides the distributed lock for leader electionFenceGuardtracks the leadership epochFenced<WriteOp>tags every write with the epoch it was produced under- When a leader’s mutex lease expires and a new leader calls
guard.advance(), all delayed writes from the old leader are rejected withStaleEpochError FabricHashMapstores the shared state that leaders write to
Building Blocks
grafos_fence::{FenceGuard, FenceEpoch, Fenced, StaleEpochError}— sourcegrafos_sync::FabricMutex— sourcegrafos_collections::map::FabricHashMap— source
Design
Epoch-Based Fencing
Every leadership term has a monotonically increasing epoch. When leader A fails and leader B takes over, B advances the epoch. Any write tagged with A’s epoch is now stale.
The key insight: the check is at the write target, not at the writer. The writer may believe it is still the leader. The guard does not care what the writer believes. It checks the epoch number. If the number is behind the current epoch, the write is rejected.
Leader A (epoch 3) writes ──────────────────▶ Guard (epoch 4): REJECTEDLeader B (epoch 4) writes ──────────────────▶ Guard (epoch 4): ACCEPTEDMutex as Failure Detector
FabricMutex is backed by a memory lease. If the leader crashes, the lease expires and the mutex
auto-releases. The new leader acquires the mutex, reads the current epoch from the protected value,
bumps it, and writes the new epoch back. This combines failure detection (lease expiry) with epoch
advancement (guard.advance()) in a single atomic operation.
Write Path
Every write to shared state goes through:
- Tag the write with the current epoch:
Fenced::new(my_epoch, write_op) - At the shared state, check:
guard.check(fenced.epoch())? - If the check passes, apply the write
- If the check fails, drop the write silently (or log it)
There is no retry. A stale write is permanently stale. The writer must re-acquire leadership (a new epoch) before it can write again.
Walkthrough (Implementation Sketch)
1. Set Up the Shared State
use grafos_fence::{FenceGuard, FenceEpoch, Fenced};use grafos_sync::FabricMutex;use grafos_collections::map::FabricHashMap;use grafos_std::mem::MemBuilder;
// Shared state: a map of assignmentslet state_lease = MemBuilder::new().min_bytes(8192).acquire()?;let mut assignments: FabricHashMap<u64, u64> = FabricHashMap::new(state_lease, 64)?;
// The fencing guard starts at epoch 0let mut guard = FenceGuard::new(FenceEpoch::zero());2. Acquire Leadership
// Leader election via distributed mutexlet lock_lease = MemBuilder::new().min_bytes(256).acquire()?;let mtx = FabricMutex::new(lock_lease, 0, 0u64)?;
// Acquire the lock — blocks until available (or lease expires on previous holder)let mut lock_guard = mtx.lock(my_node_id, 30_000)?; // 30s timeout
// Advance the epoch on leadership acquisitionlet my_epoch = guard.advance();
// Store the epoch in the mutex so the next leader knows the current value*lock_guard = my_epoch.value();3. Write Under the Current Epoch
// Tag every write with the leadership epochlet write = Fenced::new(my_epoch, (worker_id, task_id));
// At the shared state, verify the epoch before applyingguard.check(write.epoch())?; // Returns StaleEpochError if behind
let (worker, task) = write.value();assignments.put(worker, task)?;4. Leadership Failover
// Leader A's lease expires. Leader B acquires the mutex.let mut lock_guard = mtx.lock(leader_b_id, 30_000)?;
// Read the previous epoch and advancelet prev_epoch = FenceEpoch::new(*lock_guard);guard = FenceGuard::new(prev_epoch);let new_epoch = guard.advance();*lock_guard = new_epoch.value();
// From this point, all writes from leader A are rejected5. Stale Write Arrives — Rejected
// Leader A's delayed write arrives (it was in flight during failover)let stale_write = Fenced::new(old_epoch, (worker_99, task_42));
// The guard rejects itmatch guard.check(stale_write.epoch()) { Ok(()) => unreachable!(), // old epoch < current epoch Err(e) => { // StaleEpochError { expected: 5, got: 4 } // Write is silently dropped. No corruption. }}
// Leader B's write succeedslet fresh_write = Fenced::new(new_epoch, (worker_99, task_99));guard.check(fresh_write.epoch())?; // Okassignments.put(worker_99, task_99)?;Failure Modes
- Split-brain (two leaders briefly): both believe they are leader. Fencing makes this safe: the one with the higher epoch wins. The one with the lower epoch has all its writes rejected at the guard. This is the entire point of the pattern.
- Epoch overflow:
FenceEpochisu64. At one leadership change per second, overflow takes 584 billion years. - Guard and mutex on different nodes: the guard must be co-located with (or applied at) the shared state, not at the writer. If the guard is at the writer, a crashed writer cannot advance it.
- Lease renewal race: if the leader renews its mutex lease at the last moment and a new leader has already advanced the epoch, the old leader’s next write is rejected. This is correct — the epoch is the authority, not the lease.
Observability
- Leadership changes (epoch advancements) with timestamps
- Stale write rejections per epoch (a burst after failover is normal; sustained rejections indicate a confused old leader)
- Mutex acquisition latency (time to failover)
- Epoch gap between rejected writes and current epoch (large gaps indicate very delayed writes)
Variations
- Multi-resource fencing: separate
FenceGuardper resource partition. Leader failover on one partition does not affect writes to other partitions. - Read fencing: tag reads with epochs too. A stale leader reading under an old epoch gets stale data — which might be acceptable, or might not, depending on your consistency requirements.
- Epoch bridging:
FenceEpoch::from_generation()bridges grafOS rewrite-plan generations to fence epochs, so topology changes automatically fence stale writes. - Audit trail: log every
StaleEpochErrorwith the writer identity and epoch for post-incident analysis of split-brain events.