Skip to content

Recipe 38 — Anti-Affinity For Failure Domains

Problem: You’re running replicated service instances for availability. If all replicas land on the same rack, a single rack failure takes them all down.

Solution: Use anti-affinity to spread replicas across racks.

Anti-affinity is a scheduler-local topology constraint. Replicated-resource placement policy is the higher-level failure-domain contract for a logical replicated resource. Use placement to authorize providers, regions, availability zones, and quorum behavior; use anti-affinity to avoid specific racks or nodes inside those allowed domains.

Core grafOS API Path

The direct grafos-std path is: carry the topology coordinate of the existing replica, then ask the scheduler for a second lease that excludes that coordinate.

use grafos_std::cpu::CpuBuilder;
use grafos_std::affinity::{Affinity, Strength, Target};
// Replica A is already running on rack 3.
let replica_a_rack = 3;
// Replica B requests anti-affinity from rack 3.
let replica_b = CpuBuilder::new()
.single_core()
.anti_affinity(Strength::Required, Target::rack(replica_a_rack))
.lease_secs(300)
.acquire()?;
// The scheduler will NOT place replica B on any node in rack 3.
// If no other racks have capacity, the request fails (required = fail-closed).
# let _ = replica_b;
# Ok::<(), grafos_std::error::GrafosError>(())

Node-level anti-affinity:

If you don’t have rack metadata but want replicas on different nodes:

let replica_a_node: u128 = /* node ID of replica A */;
let replica_b = CpuBuilder::new()
.single_core()
.anti_affinity(Strength::Required, Target::node(replica_a_node))
.lease_secs(300)
.acquire()?;
# let _ = replica_b;
# Ok::<(), grafos_std::error::GrafosError>(())

Preferred vs Required:

  • Required — placement fails if the anti-affinity can’t be satisfied. Use for critical availability requirements.
  • Preferred — scheduler tries to spread but will co-locate if no alternative exists. Use when spreading is nice-to-have but the service must start regardless.
// Soft spread — preferred anti-affinity.
let replica_b = CpuBuilder::new()
.single_core()
.anti_affinity(Strength::Preferred, Target::rack(replica_a_rack))
.lease_secs(300)
.acquire()?;
# let _ = replica_b;
# Ok::<(), grafos_std::error::GrafosError>(())

When NOT to use anti-affinity:

  • If your service has exactly one instance — anti-affinity is meaningless with a single replica.
  • If your nodes are homogeneous and you just want spread — the scheduler’s Strategy::Spread already distributes evenly by load. Anti-affinity adds value when you need failure-domain awareness, not just load balancing.
  • If you need cross-region or cross-provider continuation — anti-affinity alone is not enough. Use replicated-resource placement policy for the logical resource and keep anti-affinity as a node/rack-level constraint inside that policy.

Multi-Cloud Placement Variant

For a logical resource that must survive provider failure, express the provider envelope in placement policy first:

use grafos_replicated::{
FailureDomain, FailureDomainLevel, PlacementPolicy, PlacementPreference,
ReplicatedResourceSpec, ResourceKind,
};
use grafos_std::affinity::{Strength, Target};
let placement = PlacementPolicy::new()
.allow(FailureDomain::cloud_provider("aws"))
.allow(FailureDomain::cloud_provider("gcp"))
.require_distinct(FailureDomainLevel::CloudProvider)
.prefer(PlacementPreference::PreferLowerEgressCost);
let replicated_queue =
ReplicatedResourceSpec::builder("checkout/orders", ResourceKind::Queue)
.placement(placement)
.build()?;
// Inside each allowed cloud, anti-affinity can still prevent two local
// workers from landing on the same rack. It cannot add a new cloud to the
// placement envelope above.
let replica_a_rack: u32 = 3; // rack hosting replica A in this provider
let rack_spread = (Strength::Required, Target::rack(replica_a_rack));
# let _ = (replicated_queue, rack_spread);
# Ok::<(), grafos_replicated::PolicyError>(())

This gives the program two separate controls:

  • placement says AWS and GCP are the only authorized provider domains for this queue, with distinct-provider replica placement;
  • anti-affinity says local workers or replicas should avoid a rack collision inside an authorized provider domain.

See also:

  • docs/spec/affinity-request-model.md §2.6 — anti-affinity encoding
  • docs/grafos/affinity-taxonomy.md §5.3 — topology affinity
  • docs/spec/hierarchical-scheduler.md — cell/orchestrator failure model