Skip to content

Conversation

@Gargi-jais11
Copy link
Contributor

What changes were proposed in this pull request?

  • EstimatedBytesToMoved and EstimatedTimeLeft should not be shown up if no container movement happens.
  • Improve threshold validation error message. When running the DiskBalancer update command with a threshold value of 100.0, the operation fails on all datanodes with the following error:
bash> ozone admin datanode diskbalancer update -t 100.0 --in-service-datanodes
Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100.

A threshold of 0 means any deviation from ideal usage (even 0.01%) triggers
container movement

This leads to excessive and continuous balancing operations and results in unnecessary I/O overhead and resource consumption
A Threshold value can never be 100.0% as it would mean allow moving 100% of a disk's contents, effectively emptying one disk.
Suggested improvement:
Rather the error message should clarify that 0 and 100 is excluded. The validation is being updated to exclude 0, requiring threshold to be in
the range (0, 100) exclusive.
new error msg:

Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14110

How was this patch tested?

Added check for estimatedBytes in unit test TestDiskBalancerService.
Tested manually:
before patch:

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      StopAfterDiskEven    SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)    
ozone-datanode-5.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               638                2                   
ozone-datanode-3.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               1                  1                   
ozone-datanode-4.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               1                  1                   
ozone-datanode-2.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               698                2                   
ozone-datanode-1.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               3                  1                   

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.

After code chnages output fixed:

bash-5.1$ ozone admin datanode diskbalancer report --in-service-datanodes
Report result:
Datanode                                           VolumeDensity
ozone-datanode-2.ozone_default                     8.413243594594944E-4
ozone-datanode-5.ozone_default                     8.296842069073773E-4
ozone-datanode-3.ozone_default                     7.682500684380311E-4
ozone-datanode-1.ozone_default                     7.585499413112762E-4
ozone-datanode-4.ozone_default                     7.507898396098833E-4

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      StopAfterDiskEven    SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)    
ozone-datanode-1.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               0                  0                   
ozone-datanode-4.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               0                  0                   
ozone-datanode-3.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               0                  0                   
ozone-datanode-5.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               0                  0                   
ozone-datanode-2.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               0                  0                   

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.

Threshold error output:

bash-5.1$ ozone admin datanode diskbalancer start -t 0 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 100 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 0.001 --in-service-datanodes
Started DiskBalancer on all IN_SERVICE nodes.

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review December 9, 2025 08:37
@ChenSammi ChenSammi requested a review from Copilot December 9, 2025 09:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two issues in the DiskBalancer service: (1) preventing the display of EstimatedBytesToMove and EstimatedTimeLeft when no container movement is occurring, and (2) improving threshold validation to exclude boundary values 0 and 100 with a clearer error message.

Key Changes:

  • Updated threshold validation to exclude 0 and 100 (changed from < 0d to <= 0d), preventing edge cases that would cause excessive or meaningless balancing
  • Modified getDiskBalancerInfo() to only calculate and report bytesToMove when containers are actively being balanced (RUNNING state AND non-empty inProgressContainers)
  • Enhanced error message to clarify that the threshold range is exclusive: "Threshold must be a percentage(double) in the range 0 to 100 both exclusive."

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java Updated threshold validation to exclude 0 and 100, and improved error message clarity
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Added check for non-empty inProgressContainers before calculating bytesToMove and updated comments
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java Added test coverage for the new inProgressContainers check in getDiskBalancerInfo()
Comments suppressed due to low confidence (1)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java:724

  public Set<ContainerID> getInProgressContainers() {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +203 to 206
if (threshold <= 0d || threshold >= 100d) {
throw new IllegalArgumentException(
"Threshold must be a percentage(double) in the range 0 to 100.");
"Threshold must be a percentage(double) in the range 0 to 100 both exclusive.");
}
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold validation logic has been changed to exclude both 0 and 100 (using <= 0d instead of < 0d). However, there are no tests validating this boundary condition. Consider adding test cases to verify that:

  1. Threshold values of 0 and 100 are rejected with the appropriate error message
  2. Valid threshold values like 0.001 and 99.999 are accepted

This will ensure the validation logic works correctly and prevent regressions.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant