Skip to content

Conversation

@ddanielr
Copy link
Contributor

Modifies ZooZap to use the same deletion logic as ServiceLock.

Modifies ZooZap to use the same deletion logic as
ServiceLock.
@ddanielr ddanielr added this to the 2.1.5 milestone Jan 10, 2026
@ddanielr ddanielr changed the title Consolidates the ServerLock deletion code Consolidate the ServerLock deletion code Jan 10, 2026
When the manager is gathering per table info it will attempt to use the
existing tserver connection to request table information.

If an exception is thrown during this request then the manager will
record the failure by computing an entry in the badServers
unable to be reached by the manager for long periods of time, the
manager will attempt to communicate with the tserver 3 times
before attempting a halt action.

This commit adds a new property of max amount of halt requests that
allows the manager to delete a zlock once the max halt requests have
been attempted.
@ddanielr ddanielr force-pushed the feature/6044-force-halt-tserver branch from 5943be4 to 87ae926 Compare January 10, 2026 05:12
…er.java

Co-authored-by: Dave Marion <dlmarion@apache.org>
"The number of threads used to run fault-tolerant executions (FATE)."
+ " These are primarily table operations like merge.",
"1.4.3"),
MANAGER_MAX_TSERVER_HALTS("manager.max.tservers.halts", "0", PropertyType.COUNT,
Copy link
Contributor

@keith-turner keith-turner Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counts can be hard to reason about because you have no idea how long it will take to do X counts. Like if this was set to 3, three attempts could happen in 50ms or take 30min. For the 50ms case, you would probably want to give it a bit more time before deleting the lock. For the 30min case, may want to delete the lock before doing 3 attempts. A combo of time and attempts seems best, but not sure how to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked this over with @ctubbsii and settled on on using a time-based max timeout for halt requests.

The number of attempts is already based on the max rpc request timeout and number of failed communication attempts to the tserver.

If the tserver is able to successfully return information after a halt is called on it, then this code path isn't followed.

What is unclear is if connections can still be attempting to a tserver that is in the process of halting.

Moves the rpc halt request into an else statement so an RPC halt request
is not attempted on a tserver without a zlock.

Moved the server removal from the halted map into the try section so it
doesn't get removed if the zlock removal failed.
Changes the naming for the property and some vars to better describe
intent.
Switches the halt logic to be time-based since number of attempts is
based on general.rpc.timeout.
@ddanielr ddanielr changed the title Consolidate the ServerLock deletion code Manager now halts tservers via lock removal Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants