Skip to content

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954

Open
devmadhuu wants to merge 2 commits intoapache:masterfrom
devmadhuu:HDDS-14871
Open

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954
devmadhuu wants to merge 2 commits intoapache:masterfrom
devmadhuu:HDDS-14871

Conversation

@devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR addresses the problem of latch timeout for pending volumes not reported any result.

StorageVolumeChecker.checkAllVolumes() waits on a single CountDownLatch for all volume health checks to complete. If the latch expires before any volume finishes — due to any transient stall — every pending volume is immediately marked FAILED with zero tolerance, producing false-positive volume failures.

The existing per-volume IO-failure sliding window in StorageVolume.check() does not address this because it only applies when a check completes, not when the latch times out.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14871

How was this patch tested?

This patch has been tested by extending 3 unit tests in existing test class : TestStorageVolumeHealthChecks

@devmadhuu devmadhuu marked this pull request as ready for review March 23, 2026 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant