Skip to content

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2479

Open
ritegarg wants to merge 3 commits into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7787-legacy-crr-sync
Open

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2479
ritegarg wants to merge 3 commits into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7787-legacy-crr-sync

Conversation

@ritegarg
Copy link
Copy Markdown
Contributor

@ritegarg ritegarg commented May 16, 2026

What changes were proposed in this pull request?

Adds an opt-in sync from /phoenix/consistentHA/<G> to the legacy /phoenix/ha/<G> znode in
each region server's HAGroupStoreClient. Each client derives a combined ClusterRoleRecord
(local role + peer role) and writes it to its local /phoenix/ha via ZK stat-version CAS, so
pre-consistentHA ZK-registry clients keep reading a current view of the HA group.

New config keys (in QueryServices / QueryServicesOptions):

  • phoenix.ha.legacy.crr.sync.enabled — master switch (default false).
  • phoenix.ha.legacy.crr.reconciliation.interval.seconds — periodic reconciliation cadence
    (default 60; 0 disables only the periodic loop, the event-driven path still runs).

Convergence model:

  • Event-driven on consistentHA CHILD_ADDED / CHILD_UPDATED (LOCAL or PEER), offloaded onto
    a per-group single-thread ScheduledExecutorService so Curator's per-namespace event
    dispatcher never blocks on ZK or JDBC I/O.
  • Periodic reconciler on the same executor (0–30s jitter on first run).
  • Curator NodeCache watcher on the legacy znode itself, so per-sync reads are in-memory
    rather than a ZK round-trip. The NodeCache is eventually consistent, so on apparent absence
    the sync path falls back to one authoritative getData() to disambiguate "absent" from
    "not-yet-cached" before deciding CREATE_NEW vs CAS_WITH_VERSION.
  • On CAS BadVersion: log at INFO and bail. The next event/periodic cycle reconverges. No
    inline retry.

Code-level changes:

  • ClusterRoleRecord: new explicit-RegistryType ctor; @JsonCreator now round-trips
    registryType (legacy znode is always written with RegistryType.ZK for backward
    compatibility, and must read back as ZK). New isLogicallyEqualIgnoringVersionAndRegistry
    helper used as the rewrite short-circuit.
  • PhoenixHAAdmin:
    • Three new helpers on the legacy namespace:
      • getClusterRoleRecordAndStatInZooKeeper — atomic (data, stat) read via
        storingStatIn.
      • createOrUpdateClusterRoleRecordWithCAS(name, record, LegacyCrrWriteMode mode, int expectedStatVersion) — typed write API. LegacyCrrWriteMode is a Phoenix-internal
        enum with values CREATE_NEW, FORCE_OVERWRITE (@VisibleForTesting),
        CAS_WITH_VERSION; the int argument is meaningful only for CAS_WITH_VERSION
        (validated >= 0 at entry). Both BadVersion (CAS lost) and NodeExists (create
        lost) surface as StaleClusterRoleRecordVersionException so callers retry
        uniformly.
    • Atomic-read fix on the existing getHAGroupStoreRecordInZooKeeper (replaces a
      pre-existing 2-RPC pattern that could return a stat version not matching its data
      bytes).
  • HAGroupStoreClient: legacyHaAdmin + legacyCrrNodeCache + dedicated periodic executor.
    Listener hooks for CHILD_ADDED / CHILD_UPDATED. CHILD_REMOVED is intentionally a
    no-op — the legacy /phoenix/ha znode is never deleted by this client.
  • New StaleClusterRoleRecordVersionException (CRR-specific analog of
    StaleHAGroupStoreRecordVersionException).

Concurrency / lifecycle invariants:

  • All sync invocations (initial, event-driven, periodic) route through the single-thread
    executor, so they are serialized naturally; no per-instance lock is held across ZK I/O.
  • The initial sync is submitted to the executor rather than invoked inline in the constructor,
    so a fresh HAGroupStoreClient does not block getInstanceForZkUrl's static class monitor
    on ZK / JDBC I/O.
  • syncLegacyCRRIfRoleChanged snapshots legacyHaAdmin / legacyCrrNodeCache to local
    finals before touching them, so a racing close() cannot NPE an in-flight call or push
    writes through a torn-down Curator client.
  • close() marks the instance isHealthy=false and removes it from the static instances
    map up front, so a concurrent getInstanceForZkUrl cannot hand out a half-closed client.

Back-compat and divergence guards:

  • Refuse to overwrite a non-ZK admin-managed legacy CRR (typically RegistryType.RPC with
    master-form URLs). Existing readers feed the stored URLs through
    JDBCUtil.formatUrl(url, roleRecord.getRegistryType()) to build connection strings;
    swapping registryType + URLs out from under them would silently change their target. The
    sync skips with a WARN and requires an explicit admin migration to ZK form before
    proceeding.
  • Wait for a non-blank peer ZK URL on the local record before building the desired CRR;
    ClusterRoleRecord's ctor would otherwise NPE on null url2 via JDBCUtil.formatUrl.
  • When the local peer cache returns null, preserve existing.role2 in the desired CRR
    rather than downgrading to UNKNOWN. UNKNOWN is strictly less information than a
    previously-recorded role, and downgrading would cause a multi-RS write storm where peers
    with differing visibility CAS-overwrite each other.

Why are the changes needed?

The Consistent Cluster Failover work moved HA group state from /phoenix/ha (which held a
ClusterRoleRecord) to /phoenix/consistentHA (which holds an HAGroupStoreRecord and uses
an RPC-registry view of CRRs). Existing Phoenix clients built against the older ZK-registry
contract still read /phoenix/ha/<G> and expect:

  • The znode to exist for any live HA group.
  • The record's registryType to be ZK.
  • The roles to reflect the latest state of both clusters.

Without this patch, after a CCF rollout the /phoenix/ha znode goes stale or absent, and old
ZK-registry clients break. This change keeps /phoenix/ha in sync with the new authoritative
/phoenix/consistentHA records on a per-RS basis, gated by an opt-in flag so it only runs
where the operator needs it.

Does this PR introduce any user-facing change?

Yes, but the new behavior is gated entirely behind the new master switch
(phoenix.ha.legacy.crr.sync.enabled), which defaults to false.

When the flag is enabled:

  • Each HAGroupStoreClient writes to (and maintains) the legacy /phoenix/ha/<G> znode on
    its local ZK.
  • Old pre-consistentHA ZK-registry clients pointed at /phoenix/ha will see fresh records
    again.
  • The flag is read once at construction; toggling it off requires a process restart. This is
    documented in HAGroupStoreClient's class Javadoc.
  • Rollout constraint: the persisted legacy znode now carries the new registryType
    field. Pre-PR clients using the default Jackson ObjectMapper
    (FAIL_ON_UNKNOWN_PROPERTIES=true) will reject the new bytes. All readers should be
    upgraded to a tolerant build before enabling this flag on any writer.

One incidental visible change independent of the flag: ClusterRoleRecord JSON
deserialization now respects the persisted registryType field. Previously it was hard-coded
to RPC regardless of JSON content. Existing JSON without a registryType field still
defaults to RPC, so deployments not exposed to legacy-sync writes see no behavior change.

How was this patch tested?

  • Integration tests in HAGroupStoreClientIT (40 cases, all passing) covering: feature-off
    no-op, initial create, role-changing transition, state-only transition (no rewrite),
    peer-driven role flip, peer-absent → role2=UNKNOWN → recovery, CAS error mapping +
    write-mode dispatch (CREATE_NEW, FORCE_OVERWRITE, CAS_WITH_VERSION negative-version
    rejection), LOCAL/PEER CHILD_REMOVED (znode preserved), registryType stays ZK across
    multiple role cycles, reconciliation interval =0 (event-driven still works), periodic-loop
    recovery after an external divergence, and the two halves of the peer-view contract:
    peer absent → existing role2 preserved (no write fires), peer present → stale role2
    reconciles to live state.
  • Unit tests in ClusterRoleRecordTest (21 cases, all passing) covering JSON backward-compat
    (missing registryType defaults to RPC, explicit RPC round-trips, explicit ZK round-trips,
    full ZK toJson/fromJson round-trip preserves registryType) and the
    isLogicallyEqualIgnoringVersionAndRegistry helper (null, same-modulo-version, ZK-vs-RPC
    URL form, role mismatch, name mismatch). Backed by three JSON fixtures under
    phoenix-core/src/test/resources/json/.
  • mvn spotless:check clean across the touched modules.
  • Ran all ITs locally, no new failures due to this PR

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic) and Cursor (Claude).

Ritesh Garg and others added 3 commits May 16, 2026 16:34
ClusterRoleRecord's ctor canonicalizes url1/url2 by lexical order, so
existing.getRole2() returns the role for whichever URL sorts larger -
not necessarily the peer URL. When the local peer cache is empty and we
were trying to preserve existing peer role, we'd half the time inherit
the local role instead, then the equality check would see a "logical
change" and trigger a redundant CAS write.

Manifests as flakes (deployment-dependent on ZK URL alphabetical order)
in HAGroupStoreClientIT.testLegacyCrrSyncPreservesPreSeededRole2WhenPeerMissing
and testLegacyCrrSyncStateOnlyChangeDoesNotRewriteLegacy under the full
IT suite; clean 40/40 across two consecutive full-verify runs with the
fix.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pure formatting changes from `mvn spotless:apply` across the touched files;
no behavior change.

Generated-by: Cursor (Claude).
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in mechanism for region servers to keep the legacy ZooKeeper-based HA ClusterRoleRecord (/phoenix/ha/<group>) synchronized from the newer consistentHA state (/phoenix/consistentHA/<group>), preserving compatibility for older ZK-registry Phoenix clients.

Changes:

  • Introduces feature-flagged legacy CRR sync in HAGroupStoreClient, including event-driven updates plus optional periodic reconciliation on a dedicated single-thread executor.
  • Adds PhoenixHAAdmin helpers for atomic read of (data, stat) and typed legacy CRR writes with explicit CAS modes and consistent stale-version exception handling.
  • Updates ClusterRoleRecord JSON handling to round-trip registryType, adds a logical-equality helper, and expands unit/integration test coverage with new JSON fixtures.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
phoenix-core/src/test/resources/json/test_role_record_no_registry_type.json Adds JSON fixture for deserialization when registryType is absent.
phoenix-core/src/test/resources/json/test_role_record_explicit_zk.json Adds JSON fixture for explicit registryType=ZK round-trip tests.
phoenix-core/src/test/resources/json/test_role_record_explicit_rpc.json Adds JSON fixture for explicit registryType=RPC round-trip tests.
phoenix-core/src/test/java/org/apache/phoenix/jdbc/ClusterRoleRecordTest.java Adds tests for registryType JSON behavior and logical equality helper.
phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreClientIT.java Adds IT coverage for legacy /phoenix/ha CRR sync behavior and edge cases.
phoenix-core-client/src/main/java/org/apache/phoenix/schema/MetaDataClient.java Comment reflow only.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java Adds defaults for new legacy CRR sync configuration keys.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServices.java Defines new config key constants for legacy CRR sync.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/PhoenixHAAdmin.java Adds atomic read helper and CAS write API for legacy CRR; fixes HAGroupStoreRecord atomic read.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java Implements opt-in legacy CRR sync, NodeCache, executor lifecycle handling, and instance deregistration.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/ClusterRoleRecord.java Makes registryType round-trip via JSON, adds explicit ctor and logical equality helper.
phoenix-core-client/src/main/java/org/apache/phoenix/exception/StaleClusterRoleRecordVersionException.java New exception type for stale legacy CRR CAS operations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +916 to +920
// Offload the legacy CRR sync (it does ZK + JDBC I/O) so we don't block
// Curator's per-namespace event dispatcher.
ScheduledExecutorService syncExec = legacyCrrSyncExecutor;
if (syncExec != null) {
syncExec.execute(this::syncLegacyCRRIfRoleChanged);
Comment on lines 1017 to +1021
/**
* Shuts down the periodic sync executor gracefully.
*/
/**
* Remove this instance from the static {@link #instances} map. Idempotent. Uses value-based
Comment on lines +244 to +256
/**
* Equality on the six identity/role fields ({@code haGroupName, policy, url1, url2, role1,
* role2}); ignores {@code version} (always bumps) and {@code registryType} (avoids RPC->ZK
* thrash). Returns {@code false} if {@code other} is {@code null}.
*/
public boolean isLogicallyEqualIgnoringVersionAndRegistry(ClusterRoleRecord other) {
if (other == null) {
return false;
}
return Objects.equals(haGroupName, other.haGroupName) && Objects.equals(policy, other.policy)
&& Objects.equals(url1, other.url1) && Objects.equals(url2, other.url2)
&& role1 == other.role1 && role2 == other.role2;
}
@@ -983,32 +1017,65 @@ private void closePeerConnection() {
/**
* Shuts down the periodic sync executor gracefully.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale javadoc ?

if (bucket == null) {
return;
}
bucket.remove(this.haGroupName, this);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the inner map introduces a race condition that could cause a NPE.

close() is not synchronized so it can run concurrently with getInstanceForZkUrl's synchronized block, which includes this code:

            instances.putIfAbsent(localZkUrl, new ConcurrentHashMap<>());
            instances.get(localZkUrl).put(haGroupName, result);

Consider:

  1. Thread A (getInstanceForZkUrl, in synchronized): instances.putIfAbsent(localZkUrl, …) — bucket already present, no-op.
  2. Thread B (close on a sibling instance for the same zkUrl): bucket.remove(siblingHaGroupName, sibling) empties the bucket; computeIfPresent removes the bucket from the outer map.
  3. Thread A: instances.get(localZkUrl) → null.
  4. Thread A: Attempt to call method put(haGroupName, result) with a null reference throws NPE inside the static synchronized block.

The new client is created but never registered, the synchronized lookup is poisoned for the caller, and on the next call we leak ZK watchers + executors by recreating without knowing about the orphan.

private boolean shouldWriteLegacyCrr(ClusterRoleRecord existing) {
// Refuse to overwrite a non-ZK (admin-managed RPC) legacy CRR; live readers use its
// registryType to build connection strings, so swapping form would break them.
if (existing != null && existing.getRegistryType() != RegistryType.ZK) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do pre-existing legacy CRRs without registryType get permanently locked out?

ClusterRoleRecord.fromJson deserializes pre-PR JSON without a registryType field as RegistryType.RPC. shouldWriteLegacyCrr then bails out here without updating anything.

private void setupLegacyCrrSync() throws Exception {
this.legacyHaAdmin = new PhoenixHAAdmin(this.zkUrl, conf, PHOENIX_HA_ZOOKEEPER_ZNODE_NAMESPACE);
this.legacyCrrNodeCache = new NodeCache(this.legacyHaAdmin.getCurator(), toPath(haGroupName));
this.legacyCrrNodeCache.start(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeCache.start() does a synchronous getData() call to ZK on this thread, which is called by the the constructor. All of this is synchronized on HAGroupStoreClient.class in getInstanceForZkUrl.

New instance creation across HA groups is serialized. If ZK is unhealthy or slow there will be a head-of-line blocking here.

Is this expected?

existing != null ? existing.getVersion() : -1L, desired.getVersion());
} catch (StaleClusterRoleRecordVersionException stale) {
// CAS lost; next event/periodic cycle reconverges.
LOGGER.info("Legacy CRR CAS lost for HA group {} at expected stat version {}", haGroupName,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this to DEBUG level.

Each RS independently runs the sync, so during a state transition every RS computes the same desired CRR and races on CAS. Only one can win and the other N−1 RSes will log this line at INFO level.

On a 500 node cluster, that is 499 unnecessary, imho, INFO level log lines every time this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants