Manager now halts tservers via lock removal #6049

ddanielr · 2026-01-10T00:52:56Z

Modifies ZooZap to use the same deletion logic as ServiceLock.

When the manager is gathering per table info it will attempt to use the existing tserver connection to request table information. If an exception is thrown during this request then the manager will record the failure by computing an entry in the badServers unable to be reached by the manager for long periods of time, the manager will attempt to communicate with the tserver 3 times before attempting a halt action. This commit adds a new property of max amount of halt requests that allows the manager to delete a zlock once the max halt requests have been attempted.

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

core/src/main/java/org/apache/accumulo/core/conf/Property.java

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

…er.java Co-authored-by: Dave Marion <dlmarion@apache.org>

keith-turner · 2026-01-12T17:35:15Z

core/src/main/java/org/apache/accumulo/core/conf/Property.java

      "The number of threads used to run fault-tolerant executions (FATE)."
          + " These are primarily table operations like merge.",
      "1.4.3"),
+  MANAGER_MAX_TSERVER_HALTS("manager.max.tservers.halts", "0", PropertyType.COUNT,


Counts can be hard to reason about because you have no idea how long it will take to do X counts. Like if this was set to 3, three attempts could happen in 50ms or take 30min. For the 50ms case, you would probably want to give it a bit more time before deleting the lock. For the 30min case, may want to delete the lock before doing 3 attempts. A combo of time and attempts seems best, but not sure how to do that.

Talked this over with @ctubbsii and settled on on using a time-based max timeout for halt requests.

The number of attempts is already based on the max rpc request timeout and number of failed communication attempts to the tserver.

If the tserver is able to successfully return information after a halt is called on it, then this code path isn't followed.

What is unclear is if connections can still be attempting to a tserver that is in the process of halting.

Moves the rpc halt request into an else statement so an RPC halt request is not attempted on a tserver without a zlock. Moved the server removal from the halted map into the try section so it doesn't get removed if the zlock removal failed.

Changes the naming for the property and some vars to better describe intent.

Switches the halt logic to be time-based since number of attempts is based on general.rpc.timeout.

Consolidates the ServerLock deletion code

1a8e74b

Modifies ZooZap to use the same deletion logic as ServiceLock.

ddanielr added this to the 2.1.5 milestone Jan 10, 2026

ddanielr changed the title ~~Consolidates the ServerLock deletion code~~ Consolidate the ServerLock deletion code Jan 10, 2026

ddanielr force-pushed the feature/6044-force-halt-tserver branch from 5943be4 to 87ae926 Compare January 10, 2026 05:12

cks-code approved these changes Jan 12, 2026

View reviewed changes

dlmarion reviewed Jan 12, 2026

View reviewed changes

Update server/manager/src/main/java/org/apache/accumulo/manager/Manag…

ebf470d

…er.java Co-authored-by: Dave Marion <dlmarion@apache.org>

keith-turner reviewed Jan 12, 2026

View reviewed changes

ddanielr added 3 commits January 12, 2026 19:53

Moves traditional rpc halt to else statement

73ef5b6

Moves the rpc halt request into an else statement so an RPC halt request is not attempted on a tserver without a zlock. Moved the server removal from the halted map into the try section so it doesn't get removed if the zlock removal failed.

Apply PR naming feedback

2a85a39

Changes the naming for the property and some vars to better describe intent.

Switches halt logic to being time based

c37f1d2

Switches the halt logic to be time-based since number of attempts is based on general.rpc.timeout.

ddanielr changed the title ~~Consolidate the ServerLock deletion code~~ Manager now halts tservers via lock removal Jan 17, 2026

ddanielr mentioned this pull request Jan 17, 2026

Consolidates the ServerLock deletion code #6065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Manager now halts tservers via lock removal #6049

Manager now halts tservers via lock removal #6049

ddanielr commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keith-turner Jan 12, 2026 •

edited

Loading

Uh oh!

ddanielr Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Manager now halts tservers via lock removal #6049

Are you sure you want to change the base?

Manager now halts tservers via lock removal #6049

Conversation

ddanielr commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keith-turner Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddanielr Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keith-turner Jan 12, 2026 •

edited

Loading