Skip to content

Recipe 27: Resilient Session Store With Managed Lease Renewal

Situation

Web session stores need to persist user state for the duration of a session (minutes to hours) and clean up automatically when the session ends. Traditional approaches use Redis with TTL, but failover is complex: sentinel election, replica promotion, split-brain risk. You want session data backed by fabric memory with automatic renewal while the user is active and automatic cleanup when they leave.

The lease model maps directly to session semantics. A session is a lease. An active user renews the lease. An idle user lets it expire. No explicit “delete session” call needed. No orphaned sessions accumulating in a database.

Cross-Domain Boundary

This recipe is a lease-local session store. It is resilient to ordinary session expiry and stale writes, but a session stored only in a local FabricKvStore is not automatically cross-region or cross-provider.

For session continuation across failure domains, model the session as a replicated resource:

  • append session events to a ReplicatedFabricLog;
  • project current session state into a ReplicatedMap;
  • store large session snapshots through replicated checkpoint/object resources;
  • use idempotency keys for externally visible effects;
  • authorize regions/providers through PlacementPolicy rather than relying on session affinity.

What You Build

A session store where:

  • Each session is backed by a FabricKvStore with per-key TTL.
  • RenewalManager with explicit RenewalPolicy (threshold, jitter, backoff) keeps active sessions alive on the happy path (user is active, lease auto-renews before TTL).
  • RenewalSummary.near_expiry triggers “session expiring soon” warnings to clients.
  • FenceGuard prevents TOCTOU on session reacquisition after expiry — a new session at the same key gets a higher epoch, and stale writes from the old session are rejected.
  • The session executor observes the typed RevokeState lifecycle on its session leases and runs the cooperative checkpoint path when the scheduler initiates a revoke (preemption, drain, fence): Active → RevokeWarning → GraceRunning → CheckpointReported → Torndown. A session that responds to the warning by persisting during the grace lets the scheduler take the cooperative path; one that ignores it gets force-torn-down with whatever wasn’t persisted lost.

Building Blocks

  • grafos_leasekit::{RenewalManager, RenewalPolicy, RenewalSummary, Backoff}source
  • grafos_kv::{KvBuilder, FabricKvStore}source
  • grafos_fence::{FenceGuard, FenceEpoch}source
  • grafos_core::RevokeState — typed 9-state lifecycle the scheduler drives on every fenced/preempted lease. The session executor pattern-matches on transitions, not strings.

Design

Session as Lease

Each user session maps to a set of KV entries with a shared TTL. The TTL is the session timeout (e.g., 30 minutes). Every user action resets the TTL by renewing the backing lease. When the user stops interacting, renewals stop and the session data expires automatically.

Renewal Policy

RenewalPolicy controls when to renew:

  • threshold: fraction of TTL remaining before renewal triggers (e.g., 0.3 means renew when 30% of TTL remains)
  • Jitter prevents thundering herds when many sessions have the same TTL
  • Backoff handles transient renewal failures with exponential retry

Expiry Warning

RenewalSummary returned by tick() includes a near_expiry list of lease IDs approaching their TTL. The application can use this to push “session expiring” warnings to connected clients.

Fencing After Expiry

If a session expires and the same user immediately creates a new session, there is a TOCTOU window where a delayed write from the old session could corrupt the new session’s data. FenceGuard eliminates this: each session carries a FenceEpoch. On session creation, guard.advance() bumps the epoch. Writes tagged with the old epoch are rejected with StaleEpochError.

Walkthrough (Implementation Sketch)

1. Create the Session Store

use grafos_kv::{KvBuilder, FabricKvStore};
use grafos_leasekit::{RenewalManager, RenewalPolicy};
use grafos_fence::{FenceGuard, FenceEpoch};
let mut sessions: FabricKvStore = KvBuilder::new()
.hot_buckets(256)
.default_ttl_secs(1800) // 30-minute session timeout
.build()?;
let mut renewals = RenewalManager::new();
let mut fence = FenceGuard::new(FenceEpoch::zero());

2. Create a Session

let session_id = b"sess:abc123";
let user_data = b"{\"user_id\": 42, \"role\": \"admin\"}";
// Store session data with explicit TTL
sessions.put_with_ttl(session_id, user_data, 1800)?;
// Register for renewal tracking
let now = unix_time_secs();
let policy = RenewalPolicy::default().with_threshold(0.3);
renewals.register(session_hash(session_id), now + 1800, policy);
// Record the session epoch for fencing
let epoch = fence.advance();

3. Renew on User Activity

// User makes a request — renew the session
sessions.put_with_ttl(session_id, updated_data, 1800)?;
// Periodic tick drives lease renewal
let summary = renewals.tick(now + 1200);
// Check for sessions approaching expiry
for lease_id in &summary.near_expiry {
// Push "session expiring soon" to the client via websocket
notify_client(*lease_id, "Session expires in 5 minutes");
}

4. Read Session Data

// Every request checks the session
match sessions.get(session_id)? {
Some(data) => {
// Session is valid — process the request
let user: UserData = deserialize(&data)?;
}
None => {
// Session expired — redirect to login
return Err(SessionExpired);
}
}

5. Fence Against Stale Writes

use grafos_fence::Fenced;
// Old session's delayed write arrives
let stale_write = Fenced::new(old_epoch, b"stale data");
// Guard rejects it
match fence.check(stale_write.epoch()) {
Ok(()) => {
// Epoch is current — apply the write
sessions.put(session_id, stale_write.value())?;
}
Err(_stale) => {
// Epoch is behind — silently drop the write
}
}

6. Session Expiry and Cleanup

// Periodic maintenance
sessions.tick()?; // evicts expired KV entries
// No explicit cleanup needed — expired sessions are gone.
// The backing memory lease returns to the fabric automatically.

7. Cooperative Checkpoint on Revoke

Idle expiry is one termination path. The other is revoke — the scheduler decides this session’s lease must end before its TTL (preemption for a higher-priority workload, node drain, fence on quota violation, etc.). The scheduler drives every revoked lease through the typed RevokeState machine:

Active ─warning→ RevokeWarning ─grace→ GraceRunning
┌─ checkpoint complete? ──no→ ForcedTeardown ──┐
│ │
yes ▼
▼ Torndown
CheckpointReported ──────────────────────────────────┘

A session executor that observes the transitions can take the cooperative path:

use cookbook_recipe_27_resilient_session_store::{
action_for_revoke_transition, CheckpointAction, SessionLifecycle,
};
use grafos_core::RevokeState;
let mut session = SessionLifecycle::fresh();
// Pubsub adapter delivers a transition on the session's lease.
match session.observe(RevokeState::Active, RevokeState::RevokeWarning) {
CheckpointAction::StopAcceptingWrites => {
// Refuse new POSTs at the HTTP layer.
}
_ => unreachable!(),
}
// Grace begins. Persist NOW so the cooperative path can finish.
match session.observe(RevokeState::RevokeWarning, RevokeState::GraceRunning) {
CheckpointAction::PersistCheckpointDuringGrace => {
// sessions.checkpoint_to_block(...)
// Then notify the scheduler the checkpoint is durable.
}
_ => unreachable!(),
}
// Scheduler advances to CheckpointReported because we reported.
let _ = session.observe(RevokeState::GraceRunning, RevokeState::CheckpointReported);
// AcknowledgeCooperativeTeardown — the cooperative path is winning.
// Terminal.
let _ = session.observe(RevokeState::CheckpointReported, RevokeState::Torndown);
// SessionTerminated.

A session executor that ignores the warning gets the forced path instead — the scheduler skips grace and goes directly to ForcedTeardown. Whatever wasn’t persisted is lost. The executor should surface this on the operator alert channel; it is the distinct “this session’s data is gone” signal vs. the routine “this session timed out idle” path.

Failure Modes

  • Renewal failure (fabric unreachable): Backoff retries with exponential delay. If all retries fail, the session expires on its TTL. The user sees “session expired” — which is the correct behavior when the infrastructure is degraded.
  • Node hosting session data departs: Disconnected error on next read. The session is lost. The user re-authenticates and gets a new session on a different node. The fence epoch prevents old-session writes from contaminating the new session.
  • Revoke during active session — cooperative path: scheduler signals RevokeWarning → GraceRunning. Session executor stops accepting writes, persists state during grace, reports CheckpointReported. Scheduler completes the cooperative teardown to Torndown.
  • Revoke during active session — forced path: session executor ignored the warning, or persistence took longer than the grace window. Scheduler skips to ForcedTeardown. In-memory state is lost. Surface this as a distinct alert from “session timed out idle”.
  • Fenced terminal state: a session reached Fenced (typically after ForcedTeardown if the teardown itself failed). The slot is quarantined — no new session may reuse it until the fence is cleared out-of-band by an operator.
  • Failed-closed terminal state: a session reached FailedClosed from Expired. Hard-rejected; the executor should NOT re-attempt creation under the same identity.
  • Clock skew: tick() depends on monotonic timestamps. If the clock jumps backward, sessions may appear to have more remaining TTL than they should. Use a monotonic source, not wall-clock time.
  • Thundering herd on popular sessions: jitter in RenewalPolicy spreads renewal requests. Each session renews independently.

Observability

  • Active session count (number of tracked leases in RenewalManager).
  • Renewal success/failure rate.
  • Near-expiry warnings issued per tick.
  • Session creation/expiry rate (helps size the KV store).
  • Stale-write rejections (count of StaleEpochError — should be rare; a spike indicates a bug).
  • Revoke-state transitions keyed on from → to per RevokeState::as_str(). SIEM rules alert on to == "forced_teardown" (cooperative path lost) and on to == "fenced" (slot quarantined). See SIEM vocabulary cookbook for the predicate library.
  • Cooperative vs forced teardown ratio per session class — a rising forced rate indicates either insufficient grace window or session executors not wired to observe the typed lifecycle.

Variations

  • Persistent sessions: enable the persistence feature on grafos-kv to spill session data to block storage. Sessions survive node restarts at the cost of higher read latency.
  • Session affinity: use PlacementScorer to co-locate a user’s session with their compute leases, reducing cross-node reads.
  • Sliding window TTL: instead of fixed 30-minute TTL, extend by activity duration. Active users get longer sessions; brief visitors expire quickly.
  • Multi-region sessions: use replicated log/map/checkpoint resources. Durable checkpoints can still be local snapshot artifacts, but they are not the cross-region replication contract by themselves.