Recipe 27: Resilient Session Store With Managed Lease Renewal
Situation
Web session stores need to persist user state for the duration of a session (minutes to hours) and clean up automatically when the session ends. Traditional approaches use Redis with TTL, but failover is complex: sentinel election, replica promotion, split-brain risk. You want session data backed by fabric memory with automatic renewal while the user is active and automatic cleanup when they leave.
The lease model maps directly to session semantics. A session is a lease. An active user renews the lease. An idle user lets it expire. No explicit “delete session” call needed. No orphaned sessions accumulating in a database.
Cross-Domain Boundary
This recipe is a lease-local session store. It is resilient to ordinary
session expiry and stale writes, but a session stored only in a local
FabricKvStore is not automatically cross-region or cross-provider.
For session continuation across failure domains, model the session as a replicated resource:
- append session events to a
ReplicatedFabricLog; - project current session state into a
ReplicatedMap; - store large session snapshots through replicated checkpoint/object resources;
- use idempotency keys for externally visible effects;
- authorize regions/providers through
PlacementPolicyrather than relying on session affinity.
What You Build
A session store where:
- Each session is backed by a
FabricKvStorewith per-key TTL. RenewalManagerwith explicitRenewalPolicy(threshold, jitter, backoff) keeps active sessions alive on the happy path (user is active, lease auto-renews before TTL).RenewalSummary.near_expirytriggers “session expiring soon” warnings to clients.FenceGuardprevents TOCTOU on session reacquisition after expiry — a new session at the same key gets a higher epoch, and stale writes from the old session are rejected.- The session executor observes the typed
RevokeStatelifecycle on its session leases and runs the cooperative checkpoint path when the scheduler initiates a revoke (preemption, drain, fence):Active → RevokeWarning → GraceRunning → CheckpointReported → Torndown. A session that responds to the warning by persisting during the grace lets the scheduler take the cooperative path; one that ignores it gets force-torn-down with whatever wasn’t persisted lost.
Building Blocks
grafos_leasekit::{RenewalManager, RenewalPolicy, RenewalSummary, Backoff}— sourcegrafos_kv::{KvBuilder, FabricKvStore}— sourcegrafos_fence::{FenceGuard, FenceEpoch}— sourcegrafos_core::RevokeState— typed 9-state lifecycle the scheduler drives on every fenced/preempted lease. The session executor pattern-matches on transitions, not strings.
Design
Session as Lease
Each user session maps to a set of KV entries with a shared TTL. The TTL is the session timeout (e.g., 30 minutes). Every user action resets the TTL by renewing the backing lease. When the user stops interacting, renewals stop and the session data expires automatically.
Renewal Policy
RenewalPolicy controls when to renew:
threshold: fraction of TTL remaining before renewal triggers (e.g., 0.3 means renew when 30% of TTL remains)- Jitter prevents thundering herds when many sessions have the same TTL
Backoffhandles transient renewal failures with exponential retry
Expiry Warning
RenewalSummary returned by tick() includes a near_expiry list of lease IDs approaching their TTL.
The application can use this to push “session expiring” warnings to connected clients.
Fencing After Expiry
If a session expires and the same user immediately creates a new session, there is a TOCTOU window where
a delayed write from the old session could corrupt the new session’s data. FenceGuard eliminates this:
each session carries a FenceEpoch. On session creation, guard.advance() bumps the epoch. Writes tagged
with the old epoch are rejected with StaleEpochError.
Walkthrough (Implementation Sketch)
1. Create the Session Store
use grafos_kv::{KvBuilder, FabricKvStore};use grafos_leasekit::{RenewalManager, RenewalPolicy};use grafos_fence::{FenceGuard, FenceEpoch};
let mut sessions: FabricKvStore = KvBuilder::new() .hot_buckets(256) .default_ttl_secs(1800) // 30-minute session timeout .build()?;
let mut renewals = RenewalManager::new();let mut fence = FenceGuard::new(FenceEpoch::zero());2. Create a Session
let session_id = b"sess:abc123";let user_data = b"{\"user_id\": 42, \"role\": \"admin\"}";
// Store session data with explicit TTLsessions.put_with_ttl(session_id, user_data, 1800)?;
// Register for renewal trackinglet now = unix_time_secs();let policy = RenewalPolicy::default().with_threshold(0.3);renewals.register(session_hash(session_id), now + 1800, policy);
// Record the session epoch for fencinglet epoch = fence.advance();3. Renew on User Activity
// User makes a request — renew the sessionsessions.put_with_ttl(session_id, updated_data, 1800)?;
// Periodic tick drives lease renewallet summary = renewals.tick(now + 1200);
// Check for sessions approaching expiryfor lease_id in &summary.near_expiry { // Push "session expiring soon" to the client via websocket notify_client(*lease_id, "Session expires in 5 minutes");}4. Read Session Data
// Every request checks the sessionmatch sessions.get(session_id)? { Some(data) => { // Session is valid — process the request let user: UserData = deserialize(&data)?; } None => { // Session expired — redirect to login return Err(SessionExpired); }}5. Fence Against Stale Writes
use grafos_fence::Fenced;
// Old session's delayed write arriveslet stale_write = Fenced::new(old_epoch, b"stale data");
// Guard rejects itmatch fence.check(stale_write.epoch()) { Ok(()) => { // Epoch is current — apply the write sessions.put(session_id, stale_write.value())?; } Err(_stale) => { // Epoch is behind — silently drop the write }}6. Session Expiry and Cleanup
// Periodic maintenancesessions.tick()?; // evicts expired KV entries
// No explicit cleanup needed — expired sessions are gone.// The backing memory lease returns to the fabric automatically.7. Cooperative Checkpoint on Revoke
Idle expiry is one termination path. The other is revoke — the
scheduler decides this session’s lease must end before its TTL
(preemption for a higher-priority workload, node drain, fence on
quota violation, etc.). The scheduler drives every revoked lease
through the typed RevokeState machine:
Active ─warning→ RevokeWarning ─grace→ GraceRunning │ ┌─ checkpoint complete? ──no→ ForcedTeardown ──┐ │ │ yes ▼ ▼ Torndown CheckpointReported ──────────────────────────────────┘A session executor that observes the transitions can take the cooperative path:
use cookbook_recipe_27_resilient_session_store::{ action_for_revoke_transition, CheckpointAction, SessionLifecycle,};use grafos_core::RevokeState;
let mut session = SessionLifecycle::fresh();
// Pubsub adapter delivers a transition on the session's lease.match session.observe(RevokeState::Active, RevokeState::RevokeWarning) { CheckpointAction::StopAcceptingWrites => { // Refuse new POSTs at the HTTP layer. } _ => unreachable!(),}
// Grace begins. Persist NOW so the cooperative path can finish.match session.observe(RevokeState::RevokeWarning, RevokeState::GraceRunning) { CheckpointAction::PersistCheckpointDuringGrace => { // sessions.checkpoint_to_block(...) // Then notify the scheduler the checkpoint is durable. } _ => unreachable!(),}
// Scheduler advances to CheckpointReported because we reported.let _ = session.observe(RevokeState::GraceRunning, RevokeState::CheckpointReported);// AcknowledgeCooperativeTeardown — the cooperative path is winning.
// Terminal.let _ = session.observe(RevokeState::CheckpointReported, RevokeState::Torndown);// SessionTerminated.A session executor that ignores the warning gets the forced path
instead — the scheduler skips grace and goes directly to
ForcedTeardown. Whatever wasn’t persisted is lost. The executor
should surface this on the operator alert channel; it is the
distinct “this session’s data is gone” signal vs. the routine “this
session timed out idle” path.
Failure Modes
- Renewal failure (fabric unreachable):
Backoffretries with exponential delay. If all retries fail, the session expires on its TTL. The user sees “session expired” — which is the correct behavior when the infrastructure is degraded. - Node hosting session data departs:
Disconnectederror on next read. The session is lost. The user re-authenticates and gets a new session on a different node. The fence epoch prevents old-session writes from contaminating the new session. - Revoke during active session — cooperative path: scheduler signals
RevokeWarning → GraceRunning. Session executor stops accepting writes, persists state during grace, reportsCheckpointReported. Scheduler completes the cooperative teardown toTorndown. - Revoke during active session — forced path: session executor ignored the warning, or persistence
took longer than the grace window. Scheduler skips to
ForcedTeardown. In-memory state is lost. Surface this as a distinct alert from “session timed out idle”. - Fenced terminal state: a session reached
Fenced(typically afterForcedTeardownif the teardown itself failed). The slot is quarantined — no new session may reuse it until the fence is cleared out-of-band by an operator. - Failed-closed terminal state: a session reached
FailedClosedfromExpired. Hard-rejected; the executor should NOT re-attempt creation under the same identity. - Clock skew:
tick()depends on monotonic timestamps. If the clock jumps backward, sessions may appear to have more remaining TTL than they should. Use a monotonic source, not wall-clock time. - Thundering herd on popular sessions: jitter in
RenewalPolicyspreads renewal requests. Each session renews independently.
Observability
- Active session count (number of tracked leases in
RenewalManager). - Renewal success/failure rate.
- Near-expiry warnings issued per tick.
- Session creation/expiry rate (helps size the KV store).
- Stale-write rejections (count of
StaleEpochError— should be rare; a spike indicates a bug). - Revoke-state transitions keyed on
from → toperRevokeState::as_str(). SIEM rules alert onto == "forced_teardown"(cooperative path lost) and onto == "fenced"(slot quarantined). See SIEM vocabulary cookbook for the predicate library. - Cooperative vs forced teardown ratio per session class — a rising forced rate indicates either insufficient grace window or session executors not wired to observe the typed lifecycle.
Variations
- Persistent sessions: enable the
persistencefeature ongrafos-kvto spill session data to block storage. Sessions survive node restarts at the cost of higher read latency. - Session affinity: use
PlacementScorerto co-locate a user’s session with their compute leases, reducing cross-node reads. - Sliding window TTL: instead of fixed 30-minute TTL, extend by activity duration. Active users get longer sessions; brief visitors expire quickly.
- Multi-region sessions: use replicated log/map/checkpoint
resources.
Durablecheckpoints can still be local snapshot artifacts, but they are not the cross-region replication contract by themselves.