Skip to content

[kv] Support elect standby replicas for primary key table#2829

Merged
wuchong merged 2 commits into
apache:mainfrom
swuferhong:leader-elect-for-standby
May 5, 2026
Merged

[kv] Support elect standby replicas for primary key table#2829
wuchong merged 2 commits into
apache:mainfrom
swuferhong:leader-elect-for-standby

Conversation

@swuferhong
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #2828

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong force-pushed the leader-elect-for-standby branch 2 times, most recently from 5dd4398 to 4087a19 Compare March 10, 2026 08:44
@swuferhong swuferhong force-pushed the leader-elect-for-standby branch 2 times, most recently from 7159e5c to fbbfdb8 Compare April 20, 2026 06:02
@wuchong wuchong requested a review from Copilot April 20, 2026 08:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds “standby replica” support for primary-key (KV) tables by extending LeaderAndIsr to carry standby replicas and wiring that through ZooKeeper JSON, RPC protos, election logic, and the related unit/integration tests.

Changes:

  • Extend LeaderAndIsr/IsrState to include standbyReplicas and propagate through coordinator/replica workflows.
  • Add standby replicas to ZK JSON serde (v2) and to RPC messages (NotifyLeaderAndIsr, AdjustIsr).
  • Update leader-election implementations/tests to elect & promote a standby for PK tables.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
fluss-server/src/test/java/org/apache/fluss/server/zk/data/LeaderAndIsrJsonSerdeTest.java Updates serde test vectors to v2 + adds backward-compat test for v1.
fluss-server/src/test/java/org/apache/fluss/server/zk/ZooKeeperClientTest.java Adjusts ZK client leader/isr tests to include standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/testutils/FlussClusterExtension.java Adds helper to fetch standby replica from LeaderAndIsr in tests.
fluss-server/src/test/java/org/apache/fluss/server/tablet/TabletServiceITCase.java Propagates standby replicas through notify/transition test flows.
fluss-server/src/test/java/org/apache/fluss/server/tablet/TabletServerShutdownITCase.java Adjusts shutdown/failover expectations for updated LeaderAndIsr.
fluss-server/src/test/java/org/apache/fluss/server/replica/fetcher/ReplicaFetcherThreadTest.java Updates fetcher tests to pass standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/replica/fetcher/ReplicaFetcherManagerTest.java Updates fetcher manager tests to pass standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/replica/fetcher/ReplicaFetcherITCase.java Updates IT case to pass standby replicas in notifications.
fluss-server/src/test/java/org/apache/fluss/server/replica/ReplicaTestBase.java Updates replica test scaffolding to include standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/replica/ReplicaTest.java Updates replica tests to include standby replicas in LeaderAndIsr.
fluss-server/src/test/java/org/apache/fluss/server/replica/ReplicaManagerTest.java Updates replica manager tests for new LeaderAndIsr shape.
fluss-server/src/test/java/org/apache/fluss/server/replica/KvRecoverFromRemoteLogITCase.java Updates recovery IT case to include standby replicas in updates.
fluss-server/src/test/java/org/apache/fluss/server/replica/HighWatermarkPersistenceTest.java Updates HW persistence test to include standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/replica/AdjustIsrTest.java Updates ISR adjustment tests + expected message formatting.
fluss-server/src/test/java/org/apache/fluss/server/replica/AdjustIsrManagerTest.java Updates adjust-ISR manager tests for standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/replica/AdjustIsrITCase.java Updates ISR expand/shrink IT to keep standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/metadata/ZkBasedMetadataProviderTest.java Updates metadata provider tests to include standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/log/remote/RemoteLogTestBase.java Updates remote log test base to create LeaderAndIsr with standby.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/TableBucketStateMachineTest.java Adds PK-table init-election tests + ensures table info present in context.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/ReplicaStateMachineTest.java Updates replica state machine tests for standby replicas.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/statemachine/ReplicaLeaderElectionTest.java Adds extensive PK standby election/promotion test coverage.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/TestCoordinatorGateway.java Ensures standby replicas are included in adjusted LeaderAndIsr responses.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorHighAvailabilityITCase.java Updates HA IT case to use new LeaderAndIsr constructor.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessorTest.java Avoids PK leader-election semantics for a rebalance test by using LOG table.
fluss-server/src/main/java/org/apache/fluss/server/zk/data/LeaderAndIsrJsonSerde.java Bumps JSON version to 2 and serializes/deserializes standby replicas.
fluss-server/src/main/java/org/apache/fluss/server/zk/data/LeaderAndIsr.java Adds standbyReplicas field and updates equality/hash/toString/newLeaderAndIsr.
fluss-server/src/main/java/org/apache/fluss/server/utils/ServerRpcMessageUtils.java Wires standby replicas through notify/adjust-isr request/response conversions.
fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java Stores standby replicas in ISR state; adds standby flag handling on follower transition.
fluss-server/src/main/java/org/apache/fluss/server/replica/IsrState.java Extends ISR state model to include standby replicas across committed/pending states.
fluss-server/src/main/java/org/apache/fluss/server/entity/NotifyLeaderAndIsrData.java Exposes standby replicas/array for RPC serialization.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/statemachine/TableBucketStateMachine.java Threads PK-table flag into election paths + adds init-election standby selection.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/statemachine/ReplicaStateMachine.java Preserves standby replicas when creating new LeaderAndIsr with NO_LEADER.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/statemachine/ReplicaLeaderElection.java Adds common election path + PK standby promotion/selection logic.
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java Ensures standby replicas are preserved when producing new LeaderAndIsr.
fluss-rpc/src/main/proto/FlussApi.proto Adds standby_replicas to adjust-isr + notify leader/isr messages.
Comments suppressed due to low confidence (1)

fluss-server/src/test/java/org/apache/fluss/server/testutils/FlussClusterExtension.java:1

  • waitAndGetStandby can throw IndexOutOfBoundsException when standbyReplicas is empty. Since the method name implies waiting for readiness, it should wait until standbyReplicas becomes non-empty (or fail with a clear assertion/timeout), instead of unconditionally doing get(0).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +594 to +608
boolean isNowStandby = (standbyReplica == localTabletServerId);
boolean wasLeader = isLeader();
boolean wasStandby = this.isStandbyReplica;
if (isNowStandby) {
// Mark as standby immediately to ensure coordinator's state is consistent.
isStandbyReplica = true;
} else {
if (wasStandby || wasLeader) {
// standby -> follower or leader -> follower
if (wasStandby) {
isStandbyReplica = false;
}
}
// follower -> follower: do nothing.
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big +1, the wasStandby and wasLeader flag are very confusing here and makes the code hard to understand.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep isNowStandby, wasLeader, and wasStandby here. They will be useful for follow-up PRs, because the behavior differs depending on the specific transition — for example, leader→standby, leader→follower, and follower→standby each require different actions: some need to delete data, others need to download data. So we need to distinguish between these cases.

Copy link
Copy Markdown
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swuferhong I left some comments.

Comment on lines +594 to +608
boolean isNowStandby = (standbyReplica == localTabletServerId);
boolean wasLeader = isLeader();
boolean wasStandby = this.isStandbyReplica;
if (isNowStandby) {
// Mark as standby immediately to ensure coordinator's state is consistent.
isStandbyReplica = true;
} else {
if (wasStandby || wasLeader) {
// standby -> follower or leader -> follower
if (wasStandby) {
isStandbyReplica = false;
}
}
// follower -> follower: do nothing.
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big +1, the wasStandby and wasLeader flag are very confusing here and makes the code hard to understand.

}

// it should be from leader to follower, we need to destroy the kv tablet
dropKv();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropKv() is called unconditionally in onBecomeNewFollower(), even when the replica is designated as a standby. IIUC, a standby replica should retain the local kv rather than dropped. Is this to be implemented in a follow-up PR, or missed in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic will be implements in next pr: #2835

localTabletServerId,
leaderEpoch,
isrToSend,
currentState.standbyReplicas(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] If a PK bucket currently has no standby and another follower later catches up, prepareIsrExpand() sends the enlarged ISR but reuses currentState.standbyReplicas() unchanged here. That persists the expanded ISR with an empty standby list, so these buckets never gain a standby on the normal replica-recovery path until some later leader election. If the standby replicas is updated, we should also send a notification to the replica to make it ready for standby.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit complex and can be implemented in a separate PR. I'll add a TODO for now.

@swuferhong swuferhong force-pushed the leader-elect-for-standby branch from fbbfdb8 to a7fd14b Compare April 21, 2026 04:15
@swuferhong swuferhong force-pushed the leader-elect-for-standby branch from a7fd14b to bc33f2d Compare April 21, 2026 05:36
@swuferhong
Copy link
Copy Markdown
Contributor Author

@wuchong comments addressed.

Copy link
Copy Markdown
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wuchong wuchong merged commit 01c13e8 into apache:main May 5, 2026
7 checks passed
@wuchong
Copy link
Copy Markdown
Member

wuchong commented May 5, 2026

@swuferhong I have merged this PR. However, to ensure backward compatibility, we should introduce a table configuration option to explicitly enable standby replicas for legacy tables. For newly created tables, standby replicas can be enabled by default.

@swuferhong
Copy link
Copy Markdown
Contributor Author

@swuferhong I have merged this PR. However, to ensure backward compatibility, we should introduce a table configuration option to explicitly enable standby replicas for legacy tables. For newly created tables, standby replicas can be enabled by default.

The linked issue: #3253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support elect standby replicas for primary key table

3 participants