HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology by ivandika3 · Pull Request #9926 · apache/ozone

ivandika3 · 2026-03-15T02:41:45Z

What changes were proposed in this pull request?

DeadNodeHandler and HealthyReadOnlyNodeHandler run on separate SingleThreadExecutors, which can lead to a race condition where a resurrected datanode is removed from the NetworkTopology after being re-added. This leaves the node reachable but invisible to the placement policy.

Fix: DeadNodeHandler now checks the current node state before removing it from the topology, skipping removal if the node is no longer DEAD. HealthyReadOnlyNodeHandler uses unconditional add (idempotent) instead of a contains-then-add check, closing the TOCTOU gap.

Made-with: Cursor

There is still a very small chance that race condition might still happen since there is no synchronization method (i.e. lock), but the chance is reduced compared to the previous implementation.

Alternative considered approaches

Use a shared SingleThreadExecutor for both DeadNodeHandler: This requires a large change in the SCM event
framework and might delay event processing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14834

How was this patch tested?

UT (Clean CI: https://github.com/ivandika3/ozone/actions/runs/23047544006)

…dOnlyNodeHandler on NetworkTopology DeadNodeHandler and HealthyReadOnlyNodeHandler run on separate SingleThreadExecutors, which can lead to a race condition where a resurrected datanode is removed from the NetworkTopology after being re-added. This leaves the node reachable but invisible to the placement policy. Fix: DeadNodeHandler now checks the current node state before removing it from the topology, skipping removal if the node is no longer DEAD. HealthyReadOnlyNodeHandler uses unconditional add (idempotent) instead of a contains-then-add check, closing the TOCTOU gap. Made-with: Cursor

Gargi-jais11 · 2026-03-16T06:39:38Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java

+      // Only remove from topology if the node is still DEAD. Between the time
+      // the DEAD_NODE event was fired and now, the node may have been
+      // resurrected (DEAD -> HEALTHY_READONLY) via a heartbeat. Removing a
+      // resurrected node from the topology would leave it reachable but
+      // invisible to the placement policy.
+      NodeStatus currentStatus =
+          nodeManager.getNodeStatus(datanodeDetails);
+      if (currentStatus.getHealth() == HddsProtos.NodeState.DEAD) {


@ivandika3 I was thinking would it make sense to add an early check at the start of onMessage and return if the node is no longer DEAD? In the race where the node is resurrected before this handler runs, we’d still run removeContainerReplicas, REPLICATION_MANAGER_NOTIFY, deletedBlockLog.onDatanodeDead, etc, which may not be appropriate for a resurrected node.

Good point, I added another check at the start. Technically, we need to do each check before doing any of these actions, but seems to be overkill.

priyeshkaratha

Changes overall LGTM. Please check few minor comments.

priyeshkaratha · 2026-03-17T04:11:46Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java

+        NetworkTopology nt = nodeManager.getClusterNetworkTopologyMap();
+        if (nt.contains(datanodeDetails)) {
+          nt.remove(datanodeDetails);
+          Preconditions.checkState(


The call to nodeManager.getNode(datanodeDetails.getID()) could return null if the node is concurrently removed from the NodeManager while this handler is executing. This would lead to a NullPointerException when .getParent() is called, which could terminate the event handler thread.
better to handle like below

DatanodeDetails node = nodeManager.getNode(datanodeDetails.getID()); if (node != null) { Preconditions.checkState(node.getParent() == null); }

Thanks, should not be introduced in this patch, but updated to prevent null.

Anyway, since SingleThreadExecutor#onEvent catches all exception, it should not cause event handler thread termination.

...dds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/HealthyReadOnlyNodeHandler.java

Gargi-jais11

Thanks @ivandika3 for updating patch, it looks almost good. Just one minor change to do.

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java

Gargi-jais11

LGTM!

priyeshkaratha

Thanks @ivandika3 for updating patch. Changes LGTM

ivandika3 self-assigned this Mar 15, 2026

Gargi-jais11 reviewed Mar 16, 2026

View reviewed changes

priyeshkaratha reviewed Mar 17, 2026

View reviewed changes

ivandika3 added 4 commits March 17, 2026 18:34

Update comments

b4f1876

Reuse currentStatus

e4d4ad2

Standardize

2b0f233

Fix regression

78cd896

Gargi-jais11 reviewed Mar 18, 2026

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DeadNodeHandler.java Outdated Show resolved Hide resolved

Fix comment

acf7fa5

Gargi-jais11 approved these changes Mar 18, 2026

View reviewed changes

priyeshkaratha approved these changes Mar 18, 2026

View reviewed changes

adoroszlai requested a review from ChenSammi March 24, 2026 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology#9926

HDDS-14834. Fix race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler on NetworkTopology#9926
ivandika3 wants to merge 6 commits intoapache:masterfrom
ivandika3:HDDS-14834

ivandika3 commented Mar 15, 2026 •

edited

Loading

Uh oh!

Gargi-jais11 Mar 16, 2026

Uh oh!

ivandika3 Mar 17, 2026

Uh oh!

priyeshkaratha left a comment

Uh oh!

priyeshkaratha Mar 17, 2026

Uh oh!

ivandika3 Mar 17, 2026

Uh oh!

Uh oh!

Gargi-jais11 left a comment •

edited

Loading

Uh oh!

Uh oh!

Gargi-jais11 left a comment

Uh oh!

priyeshkaratha left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivandika3 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Gargi-jais11 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

ivandika3 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

priyeshkaratha left a comment

Choose a reason for hiding this comment

Uh oh!

priyeshkaratha Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

ivandika3 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gargi-jais11 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gargi-jais11 left a comment

Choose a reason for hiding this comment

Uh oh!

priyeshkaratha left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivandika3 commented Mar 15, 2026 •

edited

Loading

Gargi-jais11 left a comment •

edited

Loading