Skip to content

testcluster: fix CrashNode isolation using Partitioner#166174

Open
pav-kv wants to merge 3 commits intocockroachdb:masterfrom
pav-kv:partitioner-crashnode
Open

testcluster: fix CrashNode isolation using Partitioner#166174
pav-kv wants to merge 3 commits intocockroachdb:masterfrom
pav-kv:partitioner-crashnode

Conversation

@pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Mar 19, 2026

CrashNode's circuit breaker isolation is insufficient: it only blocks outbound RPCs from the crashing node. Server-side responses on existing gRPC streams (e.g. MsgAppResp sent during raft snapshot application) can still escape after CrashClone, leaking false durability signals into the cluster.

This PR replaces circuit breakers with the Partitioner's bidirectional stream interceptors, which block both SendMsg and RecvMsg on client streams.

Commit 1 moves the Partitioner from kvnemesis into TestCluster:

  • New EnablePartitioner flag on TestClusterArgs
  • TestCluster handles interceptor registration (AddServer) and address mapping (Start, startServer)
  • kvnemesis no longer manages the Partitioner directly; uses tc.Partitioner()

Commit 2 fixes CrashNode isolation:

  • Replaces isolateNodeFromPeers (circuit breakers) with bidirectional partitions via AddPartition/RemovePartition
  • Partitions are added before CrashClone and removed after stopServerLocked

Fixes #166145

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

release-26.2: kv/kvnemesis: TestKVNemesisMultiNode_Crash failed

2 participants