Recipe 13: Testing a Distributed System Without a Distributed System
Situation
Distributed systems are hard to test because their behavior emerges from:
- failure timing
- partitions
- retries
- timeouts
Testing that in real multi-node environments is slow and flaky.
If your application is written against the lease abstraction rather than raw sockets, you can exercise a large fraction of the distributed behavior in-process with deterministic control.
What You Build
A test strategy that:
- runs application logic against
ScenarioRunnerwith step-based fault injection - deterministically injects failures (
Disconnected,LeaseExpired, etc.) at specific steps - checks invariants using assertion helpers (
assert_no_leaked_leases,assert_rebind_converges, etc.)
Building Blocks
grafos_testkit::{ScenarioRunner, FaultConfig, FaultInjector}— sourcegrafos_testkit::{assert_no_leaked_leases, assert_rebind_converges, assert_eventually, assert_stale_epoch_rejected}— sourcegrafos_observefor asserting lease lifecycle events
Related API docs:
Design
Declarative Fault Specification
Use FaultConfig to declare which operations fail at which step. No timers, no randomness,
no sleeps — everything is step-based and deterministic:
use grafos_testkit::FaultConfig;
let faults = FaultConfig::new() .read_fails_at(3) // step 3: reads fail .write_fails_at(5) // step 5: writes fail .lease_expires_at(7); // step 7: lease operations failScenarioRunner Drives the Steps
ScenarioRunner ticks the FaultInjector before each step, so step N sees the fault state
configured for tick N. Each step receives a &FaultInjector to query:
fi.should_fail_read()— true if the current step is configured to fail readsfi.should_fail_write()— true for write failuresfi.should_fail_lease()— true for lease failures
Assertion Helpers
After running a scenario, use the built-in assertion helpers:
assert_no_leaked_leases()— verify all leases were freedassert_rebind_converges()— verify convergence within a step budgetassert_eventually()— generic polling assertionassert_stale_epoch_rejected()— verify stale-epoch operations are rejected
Walkthrough (Test Sketches)
Test: Read Failure and Recovery
use grafos_testkit::{FaultConfig, ScenarioRunner};use grafos_std::error::FabricError;
let runner = ScenarioRunner::new("read-failure-recovery") .with_faults(FaultConfig::new().read_fails_at(3)) .step("step 1: read ok", |fi| { assert!(!fi.should_fail_read()); Ok(()) }) .step("step 2: read ok", |fi| { assert!(!fi.should_fail_read()); Ok(()) }) .step("step 3: read fails", |fi| { assert!(fi.should_fail_read()); // Application detects failure and starts recovery Err(FabricError::Disconnected) }) .step("step 4: read recovers", |fi| { assert!(!fi.should_fail_read()); // Application has rebound to new lease Ok(()) });
let result = runner.run();assert_eq!(result.steps_executed, 4);assert_eq!(result.steps_failed, 1);Test: Lease Expiry Mid-Write
use grafos_testkit::{FaultConfig, ScenarioRunner};use grafos_std::error::FabricError;
let runner = ScenarioRunner::new("lease-expiry-mid-write") .with_faults( FaultConfig::new() .write_fails_at(2) .lease_expires_at(2), ) .step("step 1: write succeeds", |fi| { assert!(!fi.should_fail_write()); Ok(()) }) .step("step 2: write + lease fail", |fi| { assert!(fi.should_fail_write()); assert!(fi.should_fail_lease()); Err(FabricError::LeaseExpired) }) .step("step 3: rebuild and continue", |fi| { assert!(!fi.should_fail_write()); assert!(!fi.should_fail_lease()); Ok(()) });
let result = runner.run();assert_eq!(result.steps_executed, 3);assert_eq!(result.steps_failed, 1);Post-Scenario Assertions
use grafos_testkit::{assert_no_leaked_leases, assert_stale_epoch_rejected};
// After any scenario, verify cleanupassert_no_leaked_leases();
// Verify fencing rejects stale epochsassert_stale_epoch_rejected();Variations
- property-based tests: randomize
FaultConfigacross many runs - multi-fault scenarios: combine
read_fails_at,write_fails_at, andlease_expires_aton different steps - replay logs of events captured from production (if you build an event sink)