Skip to content

Conversation

@sreejasahithi
Copy link
Contributor

@sreejasahithi sreejasahithi commented Dec 1, 2025

What changes were proposed in this pull request?

This patch introduces a new dashboard "SCM Safemode" in Grafana which contains a chart for each safemode rule displaying its target and actual value. It also displays if the SCM is in safemode or not by a binary value i.e 1 if SCM is in safemode, and 0 if it is out of safemode.

What is the link to the Apache JIRA

HDDS-14039

How was this patch tested?

Green CI : https://github.com/sreejasahithi/ozone/actions/runs/19803545209

Tested over docker cluster:
Screenshot 2025-12-05 at 11 26 10 PM

Screenshot 2025-12-05 at 11 45 19 PM

@jojochuang jojochuang self-requested a review December 1, 2025 17:22
@jojochuang
Copy link
Contributor

@rnblough

@jojochuang
Copy link
Contributor

IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1.

@sreejasahithi
Copy link
Contributor Author

@errose28 could you please review this PR.

@rnblough
Copy link
Contributor

rnblough commented Dec 3, 2025

While clear to me, I expect a little confusion from the graph being labelled Binary but going up to 2. Can the Binary axis be limited to 1?

@sumitagrawl
Copy link
Contributor

sumitagrawl commented Dec 5, 2025

@sreejasahithi These metrics are applicable only during startup for safemode exit info. May be we do not need a dashboard for this. While debug, we can do JMX query to know the status.
cc: @errose28

@errose28
Copy link
Contributor

errose28 commented Dec 5, 2025

This is essential for Ozone cluster admins/operators to monitor how long it takes their SCM to come out of safemode, especially when doing a rolling restart. Raw jmx queries are poor for usability and do not show trends over time. The dashboard is in its own file and can be ignored without harm when it is not needed.

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sreejasahithi, left a few suggestions below.

private @Metric MutableCounterLong
currentPipelinesWithAtleastOneReplicaReportedCount;

@Metric private MutableGaugeLong scmInSafeMode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a description for the metric? There are many instances in the repo where you can find (@Metric("explanation"))
As we are converting boolean to numeric here, a description would be better
Also can we change it to MutableGaugeInt instead?

"links": [],
"panels": [
{
"collapsed": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line makes it such that the row (or section) is collapsed, i.e., the panels are not visible you would have to click on the row's heading for it to open up and show the panels.
I would suggest this to be set to false.


private void emitSafeModeStatus() {
final SafeModeStatus safeModeStatus = status.get();
safeModeMetrics.setScmInSafeMode(safeModeStatus.isInSafeMode());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some assert lines in the TestSCMSafeModeManager.java for this metric for test coverage?

registeredDns = registeredDnSet.size();

if (added) {
getSafeModeMetrics().incCurrentRegisteredDatanodesCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add test coverage for this metric in TestDataNodeSafeModeRule.java? In the existing tests itself you can just assert the right value.

@Tejaskriya
Copy link
Contributor

Also, could you check why the CI seems to be failing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants