Recipe 28: Stale-Write Rejection in Distributed Leader Election

Situation

In a distributed leader election, the most dangerous failure mode is a stale leader continuing to write after a new leader has been elected. Network partitions, GC pauses, and slow I/O can all cause a leader to believe it is still in charge when it is not.

Traditional approaches use heartbeats and timeouts, but timeouts are a guess about failure. If the timeout is too short, you get false failovers. If it is too long, you have a window where two leaders coexist. You need a mechanism that mechanically rejects writes from stale leaders — not by policy or timeout, but by epoch comparison.

The grafOS fencing primitives give you this. A FenceGuard tracks the current epoch. Every write carries its epoch. Stale writes are rejected at the point of application, regardless of when they arrive.

What You Build

A leader election system where:

FabricMutex provides the distributed lock for leader election
FenceGuard tracks the leadership epoch
Fenced<WriteOp> tags every write with the epoch it was produced under
When a leader’s mutex lease expires and a new leader calls guard.advance(), all delayed writes from the old leader are rejected with StaleEpochError
FabricHashMap stores the shared state that leaders write to

Building Blocks

grafos_fence::{FenceGuard, FenceEpoch, Fenced, StaleEpochError} — source
grafos_sync::FabricMutex — source
grafos_collections::map::FabricHashMap — source

Design

Epoch-Based Fencing

Every leadership term has a monotonically increasing epoch. When leader A fails and leader B takes over, B advances the epoch. Any write tagged with A’s epoch is now stale.

The key insight: the check is at the write target, not at the writer. The writer may believe it is still the leader. The guard does not care what the writer believes. It checks the epoch number. If the number is behind the current epoch, the write is rejected.

Leader A (epoch 3) writes    ──────────────────▶  Guard (epoch 4): REJECTED
Leader B (epoch 4) writes    ──────────────────▶  Guard (epoch 4): ACCEPTED

Mutex as Failure Detector

FabricMutex is backed by a memory lease. If the leader crashes, the lease expires and the mutex auto-releases. The new leader acquires the mutex, reads the current epoch from the protected value, bumps it, and writes the new epoch back. This combines failure detection (lease expiry) with epoch advancement (guard.advance()) in a single atomic operation.

Write Path

Every write to shared state goes through:

Tag the write with the current epoch: Fenced::new(my_epoch, write_op)
At the shared state, check: guard.check(fenced.epoch())?
If the check passes, apply the write
If the check fails, drop the write silently (or log it)

There is no retry. A stale write is permanently stale. The writer must re-acquire leadership (a new epoch) before it can write again.

Walkthrough (Implementation Sketch)

1. Set Up the Shared State

use grafos_fence::{FenceGuard, FenceEpoch, Fenced};
use grafos_sync::FabricMutex;
use grafos_collections::map::FabricHashMap;
use grafos_std::mem::MemBuilder;

// Shared state: a map of assignments
let state_lease = MemBuilder::new().min_bytes(8192).acquire()?;
let mut assignments: FabricHashMap<u64, u64> = FabricHashMap::new(state_lease, 64)?;

// The fencing guard starts at epoch 0
let mut guard = FenceGuard::new(FenceEpoch::zero());

2. Acquire Leadership

// Leader election via distributed mutex
let lock_lease = MemBuilder::new().min_bytes(256).acquire()?;
let mtx = FabricMutex::new(lock_lease, 0, 0u64)?;

// Acquire the lock — blocks until available (or lease expires on previous holder)
let mut lock_guard = mtx.lock(my_node_id, 30_000)?; // 30s timeout

// Advance the epoch on leadership acquisition
let my_epoch = guard.advance();

// Store the epoch in the mutex so the next leader knows the current value
*lock_guard = my_epoch.value();

3. Write Under the Current Epoch

// Tag every write with the leadership epoch
let write = Fenced::new(my_epoch, (worker_id, task_id));

// At the shared state, verify the epoch before applying
guard.check(write.epoch())?; // Returns StaleEpochError if behind

let (worker, task) = write.value();
assignments.put(worker, task)?;

4. Leadership Failover

// Leader A's lease expires. Leader B acquires the mutex.
let mut lock_guard = mtx.lock(leader_b_id, 30_000)?;

// Read the previous epoch and advance
let prev_epoch = FenceEpoch::new(*lock_guard);
guard = FenceGuard::new(prev_epoch);
let new_epoch = guard.advance();
*lock_guard = new_epoch.value();

// From this point, all writes from leader A are rejected

5. Stale Write Arrives — Rejected

// Leader A's delayed write arrives (it was in flight during failover)
let stale_write = Fenced::new(old_epoch, (worker_99, task_42));

// The guard rejects it
match guard.check(stale_write.epoch()) {
    Ok(()) => unreachable!(), // old epoch < current epoch
    Err(e) => {
        // StaleEpochError { expected: 5, got: 4 }
        // Write is silently dropped. No corruption.
    }
}

// Leader B's write succeeds
let fresh_write = Fenced::new(new_epoch, (worker_99, task_99));
guard.check(fresh_write.epoch())?; // Ok
assignments.put(worker_99, task_99)?;

Failure Modes

Split-brain (two leaders briefly): both believe they are leader. Fencing makes this safe: the one with the higher epoch wins. The one with the lower epoch has all its writes rejected at the guard. This is the entire point of the pattern.
Epoch overflow: FenceEpoch is u64. At one leadership change per second, overflow takes 584 billion years.
Guard and mutex on different nodes: the guard must be co-located with (or applied at) the shared state, not at the writer. If the guard is at the writer, a crashed writer cannot advance it.
Lease renewal race: if the leader renews its mutex lease at the last moment and a new leader has already advanced the epoch, the old leader’s next write is rejected. This is correct — the epoch is the authority, not the lease.

Observability

Leadership changes (epoch advancements) with timestamps
Stale write rejections per epoch (a burst after failover is normal; sustained rejections indicate a confused old leader)
Mutex acquisition latency (time to failover)
Epoch gap between rejected writes and current epoch (large gaps indicate very delayed writes)

Variations

Multi-resource fencing: separate FenceGuard per resource partition. Leader failover on one partition does not affect writes to other partitions.
Read fencing: tag reads with epochs too. A stale leader reading under an old epoch gets stale data — which might be acceptable, or might not, depending on your consistency requirements.
Epoch bridging: FenceEpoch::from_generation() bridges grafOS rewrite-plan generations to fence epochs, so topology changes automatically fence stale writes.
Audit trail: log every StaleEpochError with the writer identity and epoch for post-incident analysis of split-brain events.