-
Notifications
You must be signed in to change notification settings - Fork 581
HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit #9400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1. |
|
@errose28 could you please review this PR. |
|
While clear to me, I expect a little confusion from the graph being labelled Binary but going up to 2. Can the Binary axis be limited to 1? |
|
@sreejasahithi These metrics are applicable only during startup for safemode exit info. May be we do not need a dashboard for this. While debug, we can do JMX query to know the status. |
|
This is essential for Ozone cluster admins/operators to monitor how long it takes their SCM to come out of safemode, especially when doing a rolling restart. Raw jmx queries are poor for usability and do not show trends over time. The dashboard is in its own file and can be ignored without harm when it is not needed. |
Tejaskriya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @sreejasahithi, left a few suggestions below.
| private @Metric MutableCounterLong | ||
| currentPipelinesWithAtleastOneReplicaReportedCount; | ||
|
|
||
| @Metric private MutableGaugeLong scmInSafeMode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a description for the metric? There are many instances in the repo where you can find (@Metric("explanation"))
As we are converting boolean to numeric here, a description would be better
Also can we change it to MutableGaugeInt instead?
| "links": [], | ||
| "panels": [ | ||
| { | ||
| "collapsed": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line makes it such that the row (or section) is collapsed, i.e., the panels are not visible you would have to click on the row's heading for it to open up and show the panels.
I would suggest this to be set to false.
|
|
||
| private void emitSafeModeStatus() { | ||
| final SafeModeStatus safeModeStatus = status.get(); | ||
| safeModeMetrics.setScmInSafeMode(safeModeStatus.isInSafeMode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some assert lines in the TestSCMSafeModeManager.java for this metric for test coverage?
| registeredDns = registeredDnSet.size(); | ||
|
|
||
| if (added) { | ||
| getSafeModeMetrics().incCurrentRegisteredDatanodesCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add test coverage for this metric in TestDataNodeSafeModeRule.java? In the existing tests itself you can just assert the right value.
|
Also, could you check why the CI seems to be failing? |
What changes were proposed in this pull request?
This patch introduces a new dashboard "SCM Safemode" in Grafana which contains a chart for each safemode rule displaying its target and actual value. It also displays if the SCM is in safemode or not by a binary value i.e 1 if SCM is in safemode, and 0 if it is out of safemode.
What is the link to the Apache JIRA
HDDS-14039
How was this patch tested?
Green CI : https://github.com/sreejasahithi/ozone/actions/runs/19803545209
Tested over docker cluster:
