Skip to content

Conversation

@zhaizhibo
Copy link
Contributor

@zhaizhibo zhaizhibo commented Feb 10, 2026

Motivation

When using PulsarZooKeeperClient with sessionWatcher enabled, a race condition during ZooKeeper session reconnection can cause operations to fail with SESSIONEXPIRED errors even after the session has been successfully re-established. This may also result in the loss of notifications for the SessionReestablished event.

In ZKSessionWatcher, currentStatus = SessionEvent.SessionLost. Now, we reconnect to zk,and trigger ZKSessionWatcher::checkState(SyncConnected).

thread 1 thread 2 thread 3
sessionListener.accept(SessionReestablished)
ZKMetadataStore::receivedSessionEvent
currentStatus = SessionEvent.SessionReestablished
addWatcher failed, cause old zk has expired
sessionWatcher.setSessionInvalid(), as currentStatus = SessionEvent.SessionLost
trigger PulsarZooKeeperClient.clientCreator, so we close the old zk client
sessionListener.accept(SessionReestablished)
ZKMetadataStore::receivedSessionEvent
addWatcher failed, cause old zk has bean closed, failed in the same thread.(org.apache.zookeeper.ClientCnxn#conLossPacket)
sessionWatcher.setSessionInvalid(),as currentStatus = SessionEvent.SessionLost
currentStatus = SessionEvent.SessionReestablished
create new zk client, and set new zk in PulsarZooKeeperClient::zk

After this, because the state has already been updated to SessionReestablished, the next time the session state is checked, sessionListener.accept(SessionEvent.SessionReestablished); will no longer be triggered.

Modifications

  1. set currentStatus to SessionReestablished before invoking sessionListener.accept(SessionEvent.SessionReestablished), ensuring this state update happens-before any subsequent modifications.
  2. introduce a read-write lock to synchronize updates to the ZooKeeper client.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Feb 10, 2026
};

private ZooKeeper getZkHandle() {
lock.readLock().lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this read lock is helping here. By the time we get the object and we release the lock it can be closed and reset. I would be the same as the zk.get()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary fix that actually works is in modification 1, which sets currentStatus before calling sessionListener.accept. Adding a read-write lock merely reduces the probability of operations using the old ZooKeeper client. Once a new client needs to be created, incoming operations must wait until the new client creation is complete.

@lhotari lhotari changed the title fix race condition in ZooKeeper session reconnection leading to SESSIONEXPIRED Errors [fix][meta] Fix race condition in ZooKeeper session reconnection leading to SESSIONEXPIRED Errors Feb 11, 2026
@lhotari lhotari added this to the 4.2.0 milestone Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs ready-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants