PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2479
Conversation
…h existing ZK based client
ClusterRoleRecord's ctor canonicalizes url1/url2 by lexical order, so existing.getRole2() returns the role for whichever URL sorts larger - not necessarily the peer URL. When the local peer cache is empty and we were trying to preserve existing peer role, we'd half the time inherit the local role instead, then the equality check would see a "logical change" and trigger a redundant CAS write. Manifests as flakes (deployment-dependent on ZK URL alphabetical order) in HAGroupStoreClientIT.testLegacyCrrSyncPreservesPreSeededRole2WhenPeerMissing and testLegacyCrrSyncStateOnlyChangeDoesNotRewriteLegacy under the full IT suite; clean 40/40 across two consecutive full-verify runs with the fix. Co-authored-by: Cursor <cursoragent@cursor.com>
Pure formatting changes from `mvn spotless:apply` across the touched files; no behavior change. Generated-by: Cursor (Claude). Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in mechanism for region servers to keep the legacy ZooKeeper-based HA ClusterRoleRecord (/phoenix/ha/<group>) synchronized from the newer consistentHA state (/phoenix/consistentHA/<group>), preserving compatibility for older ZK-registry Phoenix clients.
Changes:
- Introduces feature-flagged legacy CRR sync in
HAGroupStoreClient, including event-driven updates plus optional periodic reconciliation on a dedicated single-thread executor. - Adds PhoenixHAAdmin helpers for atomic read of
(data, stat)and typed legacy CRR writes with explicit CAS modes and consistent stale-version exception handling. - Updates
ClusterRoleRecordJSON handling to round-tripregistryType, adds a logical-equality helper, and expands unit/integration test coverage with new JSON fixtures.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| phoenix-core/src/test/resources/json/test_role_record_no_registry_type.json | Adds JSON fixture for deserialization when registryType is absent. |
| phoenix-core/src/test/resources/json/test_role_record_explicit_zk.json | Adds JSON fixture for explicit registryType=ZK round-trip tests. |
| phoenix-core/src/test/resources/json/test_role_record_explicit_rpc.json | Adds JSON fixture for explicit registryType=RPC round-trip tests. |
| phoenix-core/src/test/java/org/apache/phoenix/jdbc/ClusterRoleRecordTest.java | Adds tests for registryType JSON behavior and logical equality helper. |
| phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreClientIT.java | Adds IT coverage for legacy /phoenix/ha CRR sync behavior and edge cases. |
| phoenix-core-client/src/main/java/org/apache/phoenix/schema/MetaDataClient.java | Comment reflow only. |
| phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java | Adds defaults for new legacy CRR sync configuration keys. |
| phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServices.java | Defines new config key constants for legacy CRR sync. |
| phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/PhoenixHAAdmin.java | Adds atomic read helper and CAS write API for legacy CRR; fixes HAGroupStoreRecord atomic read. |
| phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java | Implements opt-in legacy CRR sync, NodeCache, executor lifecycle handling, and instance deregistration. |
| phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/ClusterRoleRecord.java | Makes registryType round-trip via JSON, adds explicit ctor and logical equality helper. |
| phoenix-core-client/src/main/java/org/apache/phoenix/exception/StaleClusterRoleRecordVersionException.java | New exception type for stale legacy CRR CAS operations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Offload the legacy CRR sync (it does ZK + JDBC I/O) so we don't block | ||
| // Curator's per-namespace event dispatcher. | ||
| ScheduledExecutorService syncExec = legacyCrrSyncExecutor; | ||
| if (syncExec != null) { | ||
| syncExec.execute(this::syncLegacyCRRIfRoleChanged); |
| /** | ||
| * Shuts down the periodic sync executor gracefully. | ||
| */ | ||
| /** | ||
| * Remove this instance from the static {@link #instances} map. Idempotent. Uses value-based |
| /** | ||
| * Equality on the six identity/role fields ({@code haGroupName, policy, url1, url2, role1, | ||
| * role2}); ignores {@code version} (always bumps) and {@code registryType} (avoids RPC->ZK | ||
| * thrash). Returns {@code false} if {@code other} is {@code null}. | ||
| */ | ||
| public boolean isLogicallyEqualIgnoringVersionAndRegistry(ClusterRoleRecord other) { | ||
| if (other == null) { | ||
| return false; | ||
| } | ||
| return Objects.equals(haGroupName, other.haGroupName) && Objects.equals(policy, other.policy) | ||
| && Objects.equals(url1, other.url1) && Objects.equals(url2, other.url2) | ||
| && role1 == other.role1 && role2 == other.role2; | ||
| } |
| @@ -983,32 +1017,65 @@ private void closePeerConnection() { | |||
| /** | |||
| * Shuts down the periodic sync executor gracefully. | |||
| */ | |||
| if (bucket == null) { | ||
| return; | ||
| } | ||
| bucket.remove(this.haGroupName, this); |
There was a problem hiding this comment.
Removing the inner map introduces a race condition that could cause a NPE.
close() is not synchronized so it can run concurrently with getInstanceForZkUrl's synchronized block, which includes this code:
instances.putIfAbsent(localZkUrl, new ConcurrentHashMap<>());
instances.get(localZkUrl).put(haGroupName, result);Consider:
- Thread A (getInstanceForZkUrl, in synchronized): instances.putIfAbsent(localZkUrl, …) — bucket already present, no-op.
- Thread B (close on a sibling instance for the same zkUrl): bucket.remove(siblingHaGroupName, sibling) empties the bucket; computeIfPresent removes the bucket from the outer map.
- Thread A: instances.get(localZkUrl) → null.
- Thread A: Attempt to call method put(haGroupName, result) with a null reference throws NPE inside the static synchronized block.
The new client is created but never registered, the synchronized lookup is poisoned for the caller, and on the next call we leak ZK watchers + executors by recreating without knowing about the orphan.
| private boolean shouldWriteLegacyCrr(ClusterRoleRecord existing) { | ||
| // Refuse to overwrite a non-ZK (admin-managed RPC) legacy CRR; live readers use its | ||
| // registryType to build connection strings, so swapping form would break them. | ||
| if (existing != null && existing.getRegistryType() != RegistryType.ZK) { |
There was a problem hiding this comment.
Do pre-existing legacy CRRs without registryType get permanently locked out?
ClusterRoleRecord.fromJson deserializes pre-PR JSON without a registryType field as RegistryType.RPC. shouldWriteLegacyCrr then bails out here without updating anything.
| private void setupLegacyCrrSync() throws Exception { | ||
| this.legacyHaAdmin = new PhoenixHAAdmin(this.zkUrl, conf, PHOENIX_HA_ZOOKEEPER_ZNODE_NAMESPACE); | ||
| this.legacyCrrNodeCache = new NodeCache(this.legacyHaAdmin.getCurator(), toPath(haGroupName)); | ||
| this.legacyCrrNodeCache.start(true); |
There was a problem hiding this comment.
NodeCache.start() does a synchronous getData() call to ZK on this thread, which is called by the the constructor. All of this is synchronized on HAGroupStoreClient.class in getInstanceForZkUrl.
New instance creation across HA groups is serialized. If ZK is unhealthy or slow there will be a head-of-line blocking here.
Is this expected?
| existing != null ? existing.getVersion() : -1L, desired.getVersion()); | ||
| } catch (StaleClusterRoleRecordVersionException stale) { | ||
| // CAS lost; next event/periodic cycle reconverges. | ||
| LOGGER.info("Legacy CRR CAS lost for HA group {} at expected stat version {}", haGroupName, |
There was a problem hiding this comment.
Drop this to DEBUG level.
Each RS independently runs the sync, so during a state transition every RS computes the same desired CRR and races on CAS. Only one can win and the other N−1 RSes will log this line at INFO level.
On a 500 node cluster, that is 499 unnecessary, imho, INFO level log lines every time this happens.
What changes were proposed in this pull request?
Adds an opt-in sync from
/phoenix/consistentHA/<G>to the legacy/phoenix/ha/<G>znode ineach region server's
HAGroupStoreClient. Each client derives a combinedClusterRoleRecord(local role + peer role) and writes it to its local
/phoenix/havia ZK stat-version CAS, sopre-consistentHA ZK-registry clients keep reading a current view of the HA group.
New config keys (in
QueryServices/QueryServicesOptions):phoenix.ha.legacy.crr.sync.enabled— master switch (defaultfalse).phoenix.ha.legacy.crr.reconciliation.interval.seconds— periodic reconciliation cadence(default
60;0disables only the periodic loop, the event-driven path still runs).Convergence model:
CHILD_ADDED/CHILD_UPDATED(LOCAL or PEER), offloaded ontoa per-group single-thread
ScheduledExecutorServiceso Curator's per-namespace eventdispatcher never blocks on ZK or JDBC I/O.
NodeCachewatcher on the legacy znode itself, so per-sync reads are in-memoryrather than a ZK round-trip. The NodeCache is eventually consistent, so on apparent absence
the sync path falls back to one authoritative
getData()to disambiguate "absent" from"not-yet-cached" before deciding
CREATE_NEWvsCAS_WITH_VERSION.BadVersion: log at INFO and bail. The next event/periodic cycle reconverges. Noinline retry.
Code-level changes:
ClusterRoleRecord: new explicit-RegistryTypector;@JsonCreatornow round-tripsregistryType(legacy znode is always written withRegistryType.ZKfor backwardcompatibility, and must read back as ZK). New
isLogicallyEqualIgnoringVersionAndRegistryhelper used as the rewrite short-circuit.
PhoenixHAAdmin:getClusterRoleRecordAndStatInZooKeeper— atomic (data, stat) read viastoringStatIn.createOrUpdateClusterRoleRecordWithCAS(name, record, LegacyCrrWriteMode mode, int expectedStatVersion)— typed write API.LegacyCrrWriteModeis a Phoenix-internalenum with values
CREATE_NEW,FORCE_OVERWRITE(@VisibleForTesting),CAS_WITH_VERSION; theintargument is meaningful only forCAS_WITH_VERSION(validated
>= 0at entry). BothBadVersion(CAS lost) andNodeExists(createlost) surface as
StaleClusterRoleRecordVersionExceptionso callers retryuniformly.
getHAGroupStoreRecordInZooKeeper(replaces apre-existing 2-RPC pattern that could return a stat version not matching its data
bytes).
HAGroupStoreClient:legacyHaAdmin+legacyCrrNodeCache+ dedicated periodic executor.Listener hooks for
CHILD_ADDED/CHILD_UPDATED.CHILD_REMOVEDis intentionally ano-op — the legacy
/phoenix/haznode is never deleted by this client.StaleClusterRoleRecordVersionException(CRR-specific analog ofStaleHAGroupStoreRecordVersionException).Concurrency / lifecycle invariants:
executor, so they are serialized naturally; no per-instance lock is held across ZK I/O.
so a fresh
HAGroupStoreClientdoes not blockgetInstanceForZkUrl's static class monitoron ZK / JDBC I/O.
syncLegacyCRRIfRoleChangedsnapshotslegacyHaAdmin/legacyCrrNodeCacheto localfinals before touching them, so a racing
close()cannot NPE an in-flight call or pushwrites through a torn-down Curator client.
close()marks the instanceisHealthy=falseand removes it from the staticinstancesmap up front, so a concurrent
getInstanceForZkUrlcannot hand out a half-closed client.Back-compat and divergence guards:
RegistryType.RPCwithmaster-form URLs). Existing readers feed the stored URLs through
JDBCUtil.formatUrl(url, roleRecord.getRegistryType())to build connection strings;swapping
registryType+ URLs out from under them would silently change their target. Thesync skips with a WARN and requires an explicit admin migration to ZK form before
proceeding.
ClusterRoleRecord's ctor would otherwise NPE onnull url2viaJDBCUtil.formatUrl.existing.role2in the desired CRRrather than downgrading to
UNKNOWN.UNKNOWNis strictly less information than apreviously-recorded role, and downgrading would cause a multi-RS write storm where peers
with differing visibility CAS-overwrite each other.
Why are the changes needed?
The Consistent Cluster Failover work moved HA group state from
/phoenix/ha(which held aClusterRoleRecord) to/phoenix/consistentHA(which holds anHAGroupStoreRecordand usesan RPC-registry view of CRRs). Existing Phoenix clients built against the older ZK-registry
contract still read
/phoenix/ha/<G>and expect:registryTypeto beZK.Without this patch, after a CCF rollout the
/phoenix/haznode goes stale or absent, and oldZK-registry clients break. This change keeps
/phoenix/hain sync with the new authoritative/phoenix/consistentHArecords on a per-RS basis, gated by an opt-in flag so it only runswhere the operator needs it.
Does this PR introduce any user-facing change?
Yes, but the new behavior is gated entirely behind the new master switch
(
phoenix.ha.legacy.crr.sync.enabled), which defaults tofalse.When the flag is enabled:
HAGroupStoreClientwrites to (and maintains) the legacy/phoenix/ha/<G>znode onits local ZK.
/phoenix/hawill see fresh recordsagain.
documented in
HAGroupStoreClient's class Javadoc.registryTypefield. Pre-PR clients using the default Jackson
ObjectMapper(
FAIL_ON_UNKNOWN_PROPERTIES=true) will reject the new bytes. All readers should beupgraded to a tolerant build before enabling this flag on any writer.
One incidental visible change independent of the flag:
ClusterRoleRecordJSONdeserialization now respects the persisted
registryTypefield. Previously it was hard-codedto
RPCregardless of JSON content. Existing JSON without aregistryTypefield stilldefaults to
RPC, so deployments not exposed to legacy-sync writes see no behavior change.How was this patch tested?
HAGroupStoreClientIT(40 cases, all passing) covering: feature-offno-op, initial create, role-changing transition, state-only transition (no rewrite),
peer-driven role flip, peer-absent →
role2=UNKNOWN→ recovery, CAS error mapping +write-mode dispatch (
CREATE_NEW,FORCE_OVERWRITE,CAS_WITH_VERSIONnegative-versionrejection), LOCAL/PEER
CHILD_REMOVED(znode preserved),registryTypestays ZK acrossmultiple role cycles, reconciliation interval
=0(event-driven still works), periodic-looprecovery after an external divergence, and the two halves of the peer-view contract:
peer absent → existing
role2preserved (no write fires), peer present → stalerole2reconciles to live state.
ClusterRoleRecordTest(21 cases, all passing) covering JSON backward-compat(missing
registryTypedefaults to RPC, explicit RPC round-trips, explicit ZK round-trips,full ZK toJson/fromJson round-trip preserves
registryType) and theisLogicallyEqualIgnoringVersionAndRegistryhelper (null, same-modulo-version, ZK-vs-RPCURL form, role mismatch, name mismatch). Backed by three JSON fixtures under
phoenix-core/src/test/resources/json/.mvn spotless:checkclean across the touched modules.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude (Anthropic) and Cursor (Claude).