Skip to content

RDMA Lease Revoke Semantics

Status: descriptive. Documents the currently-shipped behavior of RDMA lease revocation in fabricBIOS, with particular attention to the split between the authoritative control-plane contract (closed) and the additive dataplane-observability property for non-cooperating clients (investigated + baselined, not tuned further per reviewer guidance).

Scope: applies to the x86-uefi bare-metal firmware driving a ConnectX-5 over VFIO passthrough. The conclusions below are CX-5 firmware 16.35.4506 specific where noted; the control-plane contract is transport-agnostic.

Related:

  • docs/spec/fabricbios-wire-encoding-v0.mdLEASE_REVOKE op encoding (0x020A).
  • TODO.md items 10958 (control-plane contract, closed) and 10959 (dataplane property, baselined).
  • crates/fabricbios-x86-uefi/src/rdma_backend.rs::Mlx5RdmaBackend::lease_revoke_outcome.
  • crates/fabricbios-x86-uefi/src/mlx5_hw.rs::rdma_lease_revoke_fence, rdma_lease_sweep_fenced.

1. Two Distinct Properties

RDMA lease revocation has two separable observability paths. Keeping them distinct is critical — failing to do so historically caused the team to treat a hardware characteristic as a tuning problem.

1.1 Control-plane authoritative recall (CLOSED)

Property: after a LEASE_REVOKE (op 0x020A) request, the server returns a structured outcome code within ≤ 1 s, authoritatively declaring the lease recalled:

CodeMeaning
TornDown (0)Clean teardown initiated; slot will be released once deferred destroy sweeps
Fenced (1)Partial teardown failure; slot permanently consumed (fail-closed)
NotFound (2)No active lease with this id

Observed cross-host on bare-metal x86-uefi firmware with ConnectX-5 (firmware 16.35.4506) via the dataplane verifier harness: ≈ 12 ms recall latency (includes a synchronous flow-steering FTE install; bare 2ERR_QP + DESTROY_MKEY is ≈ 1.5 ms).

A cooperating runtime acts on this outcome immediately — it must stop issuing RDMA ops to the revoked lease’s QP/rkey once the TornDown or Fenced outcome is received. This is the primary revoke semantic and is unchanged by anything below.

1.2 Dataplane observability for non-cooperating clients (BASELINED)

Property (desired): a sustained ibv_post_send(IBV_WR_RDMA_WRITE) loop by a client that IGNORES the control-plane outcome should see a completion error within a bounded sub-second interval after the server’s revoke.

Property (current baseline on CX-5 fw 16.35.4506): ≈ 1.4 s dataplane observation latency, status IBV_WC_REM_ACCESS_ERR (12). This is a hardware floor, not a tuning target. See §3 for the mechanism.

2. Teardown Pipeline

The revoke teardown is split into two phases to give clean resource lifecycle and to provide a hook point for any future stricter dataplane work (§5 option c).

2.1 Immediate phase (rdma_lease_revoke_fence)

Runs synchronously in the LEASE_REVOKE request handler:

  1. 2ERR_QP (opcode 0x0507) — transitions the QP to ERR state.
  2. DESTROY_MKEY (opcode 0x0202) — invalidates the rkey.
  3. (If RDMA_RX FT setup succeeded at boot) SET_FLOW_TABLE_ENTRY on the pre-configured misc.bth_opcode DROP group, indexed by the lease slot. Best-effort; does NOT buy latency on this firmware (§3) but is the skeleton for option (c).
  4. Mark slot pending_destroy_at_secs = now + GRACE_SECS (default 5 s).

Outcome: TornDown (all three of 1–3 ok) or Fenced (2ERR_QP or DESTROY_MKEY failed). Returns via LEASE_REVOKE wire response.

2.2 Deferred phase (rdma_lease_sweep_fenced)

Runs from the QUIC main-loop idle path (tick_rdma_sweep) on pending-destroy slots past their deadline:

  1. DELETE_FLOW_TABLE_ENTRY for the slot’s installed FTE.
  2. 2RST_QP (opcode 0x050A) — drains QP from ERR to RESET.
  3. DESTROY_QP (opcode 0x0501) — releases the QP slot on the HCA.
  4. Clear slot (reusable for new lease) OR fence (if 2RST or DESTROY failed).

2.3 Fail-closed fencing

If any destroy command fails, the slot is permanently fenced: RdmaLeaseTable::fence(slot, failed_mask, origin) preserves the failure context on the slot for post-mortem. FENCE_ORIGIN_* constants distinguish LEGACY (synchronous path), REVOKE (immediate phase), and SWEEP (deferred phase). fenced_summary() exposes a diagnostic view. Fenced slots are not reused until firmware reboot.

Deterministic regression test: [FENCE-TEST] in main.rs (gated by bringup-tests feature) corrupts mkey_index to 0x00FF_FFFF, which forces DESTROY_MKEY to return BAD_PARAM on the HCA, and asserts the slot becomes fenced=true, active=false, table.fenced_count +1.

3. CX-5 Dataplane Observability Floor

3.1 Empirical observation

Across every tested approach — bare 2ERR_QP + DESTROY_MKEY, the immediate-fence + deferred-destroy split, flow-table DROP on misc.bth_dst_qp, flow-table DROP on misc.bth_opcode=0x0A (RDMA WRITE Only) — the measured dataplane first-fail latency is ≈ 1.4 s. The failing completion status is consistently IBV_WC_REM_ACCESS_ERR (12), not IBV_WC_REM_OP_ERR (11) that a true QP-in-ERR NAK would produce.

3.2 Mechanism (hypothesis, consistent with all observations)

The HCA maintains a per-QP fast-path dispatch cache. Once traffic is flowing to a QP:

  • Installed flow rules do not invalidate the cache. Proven: the same bth_opcode=0x0A DROP rule that intercepts every WRITE when installed PRE-traffic does NOT intercept inflight WRITEs when installed MID-stream (during revoke).
  • Only QP state transitions through 2ERR2RSTDESTROY_QP invalidate the cache, and that sequence itself has the ≈ 1.4 s drain floor.

Additional confirming evidence:

  • misc.bth_dst_qp matching is non-functional for generic RoCEv2 dispatch despite ft_field_support.bth_dst_qp=1 capability bit. Kernel mlx5_ib only uses this field when IB_QP_CREATE_SOURCE_QPN underlay is configured (drivers/infiniband/hw/mlx5/fs.c:995-1003). Any value (388 = actual qpn, 0, 0xAAAAAA) left traffic flowing unchanged.
  • SET_FLOW_TABLE_ENTRY on an already-occupied flow_index returns status=0x08 (BAD_INDEX) — no atomic modify-in-place.
  • Pre-installing an ALLOW rule and switching to DROP at revoke ran into the same floor (not directly tested to completion, but the HCA-cache model predicts it and the reviewer explicitly flagged this workaround pattern as not philosophically preferred).

3.3 Baseline acceptance, not tuning

Per reviewer guidance:

The honest conclusion is that this firmware/hardware path appears to impose a real floor for mid-stream dataplane observability. … Treat that as the baseline (a). If we decide the stricter non-cooperating-client property is worth the lift, pursue the real mlx5-native underlay / SOURCE_QPN path (c) rather than a workaround-first approach.

The cross-host regression test (lease_alloc_rdma_verify --revoke-test) now reports the dataplane latency informationally with a 2000 ms regression budget. Exceeding that budget is a signal of genuine regression (lost ack path, new cache behavior, misconfiguration); within it is the known baseline.

4. Diagnostic Infrastructure

Four feature-gated probes are preserved in the firmware source so any future investigation can reproduce the key findings:

FeatureInstallPurpose
nic-rx-drop-probeNIC_RX DROP on UDP/4791 at bootProves NIC_RX is BYPASSED for RoCEv2 on this firmware (verifier WRITE still succeeds) — eliminates that steering domain
rdma-rx-ft-probeStandalone CREATE_FLOW_TABLE on RDMA_RXProves FT creation with table_type=0x07 succeeds (contradicts prior “silent null” claim)
rdma-rx-catchall-dropRDMA_RX match-anything DROP at bootProves the FT IS in the active dispatch path (verifier WRITE fails immediately pre-traffic)
rdma-rx-bth-mismatch-probeRDMA_RX misc.bth_opcode=0x0A DROP at bootProves MISC parsing works, bth_opcode extraction works, DROP action works — all PRE-traffic

Each probe is enabled by building the bare-metal firmware with the feature flag in the table above and running the dataplane verifier harness against the resulting image.

5. Resolution Options (if Stricter Property Is Ever Justified)

(a) Accept baseline (current posture)

The documented ~1.4 s dataplane floor on this firmware revision is a hardware characteristic. Cooperating clients unaffected; the control-plane contract is the primary revoke channel. This is the current answer.

(b) Pre-provision at LEASE_ALLOC

Install an ALLOW flow rule at LEASE_ALLOC (before the QP gets hot), then DELETE+SET to DROP at revoke. Theoretical — may or may not bypass the HCA cache. Not philosophically preferred per reviewer. Acceptable only as an explicit mitigation experiment.

(c) SOURCE_QPN / underlay-QPN native path

The mlx5-native architectural answer. Per-lease underlay QP allocation via IB_QP_CREATE_SOURCE_QPN; flow rules attached to the underlay QP BEFORE the main QP is hot; misc.bth_dst_qp matching then engages the cache-invalidation machinery the hardware was designed for. Substantial machinery (per-lease extra QP + TIR + RQ wrapper). See kernel drivers/infiniband/hw/mlx5/fs.c:1374 for the reference implementation. This is the real path if the stricter property is declared worth pursuing.

(d) Kernel-trace cache-invalidation deep dive

If options (a)-(c) are all unsatisfactory, capture kernel mlx5_ib flow-steering byte sequences during actual IPoIB / underlay setup via dump_command dynamic debug on mlx5_core cmd.c, and replicate the EXACT sequence including any cache-invalidation commands not yet identified. High-effort; last resort.