Skip to content

HDDS-14862. Log volume failures and database errors as errors#9950

Draft
ptlrs wants to merge 1 commit intoapache:masterfrom
ptlrs:HDDS-14862-Log-volume-failures-as-errors
Draft

HDDS-14862. Log volume failures and database errors as errors#9950
ptlrs wants to merge 1 commit intoapache:masterfrom
ptlrs:HDDS-14862-Log-volume-failures-as-errors

Conversation

@ptlrs
Copy link
Contributor

@ptlrs ptlrs commented Mar 19, 2026

What changes were proposed in this pull request?

Volume failure is currently logged at INFO level. It should be marked as an ERROR as a volume failing is an actual problem in the system and also searching for ERRORs in the log files should flag this.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14862

How was this patch tested?

CI: https://github.com/ptlrs/ozone/actions/runs/23277693312

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for working on this.

dbLoaded.set(false);
dbLoadFailure.set(false);
LOG.info("SchemaV3 db is stopped at {} for volume {}", containerDBPath,
LOG.warn("SchemaV3 db is stopped at {} for volume {}", containerDBPath,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closeDbStore() is executed in normal shutdown, too, so this shouldn't be a warning. For failure case, warning is logged in callers of failVolume.

volumeHealthMetrics.decrementHealthyVolumes();
volumeHealthMetrics.incrementFailedVolumes();
LOG.info("Moving Volume : {} to failed Volumes", volumeRoot);
LOG.error("Moving Volume : {} to failed Volumes", volumeRoot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an error. Callers of failVolume log at higher level.

@errose28 errose28 self-requested a review March 23, 2026 16:23
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @ptlrs.

Time.monotonicNowNanos() - start);
} catch (Exception e) {
LOG.warn("compact rocksdb error in {}", dbFilePath, e);
LOG.error("compact rocksdb error in {}", dbFilePath, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be a transient IO error, which happens sometimes. Since we don't act on the failure I think the original warning level makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think anything resulting a failed volume check should be logged at the error level. There are two such cases in HddsVolume#check that we can elevate from warn to error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants