Fix smartmon health status reporting by technowhizz · Pull Request #2322 · stackhpc/stackhpc-kayobe-config

technowhizz · 2026-05-22T13:05:37Z

The smartmon exporter now treats historical SMART temperature and airflow threshold breaches as non-critical when calculating the smartmon_device_smart_healthy metric. This prevents disks with only a past over-temperature event from being reported as actively unhealthy.

gemini-code-assist

Code Review

This pull request updates the smartmon exporter to treat historical SMART temperature and airflow threshold breaches as non-critical when calculating the device health metric. It also introduces a new metric, smartmon_device_historical_temperature_failure, to allow these events to be monitored separately. The review feedback suggests optimizing the implementation by passing pre-computed failed attributes to the health evaluation function to avoid redundant processing and simplifying the logic for deriving the new metric.

gemini-code-assist · 2026-05-22T13:07:15Z

+def smart_health_value(device):
+    """
+    Convert pySMART assessment into the exported healthy metric.
+
+    PASS is healthy. WARN is also treated as healthy only when every failed
+    attribute is a historical temperature/airflow threshold breach. Other WARN
+    states, FAIL states, current failures, and non-temperature historical
+    failures remain unhealthy.
+    """
+    assessment = str(device.assessment or "").strip().upper()
+
+    if assessment == "PASS":
+        return 1
+
+    if assessment != "WARN":
+        return 0
+
+    failed_attrs = get_failed_smart_attributes(device)
+    if not failed_attrs:
+        return 0
+
+    if all(is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs):
+        return 1
+
+    return 0


The smart_health_value function can be optimized by accepting failed_attrs as an argument. This avoids redundant calls to get_failed_smart_attributes when this function is used in conjunction with other logic that also requires the list of failed attributes, as seen in parse_device_info.

def smart_health_value(device, failed_attrs): """ Convert pySMART assessment into the exported healthy metric. PASS is healthy. WARN is also treated as healthy only when every failed attribute is a historical temperature/airflow threshold breach. Other WARN states, FAIL states, current failures, and non-temperature historical failures remain unhealthy. """ assessment = str(device.assessment or "").strip().upper() if assessment == "PASS": return 1 if assessment != "WARN": return 0 if not failed_attrs: return 0 if all(is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs): return 1 return 0

gemini-code-assist · 2026-05-22T13:07:15Z

+            is_healthy = smart_health_value(device)
            metrics.append(
                f'smartmon_device_smart_healthy{{{metric_labels}}} {float(is_healthy)}'
            )
+            failed_attrs = get_failed_smart_attributes(device)
+            historical_temperature_attr_failure = 1 if failed_attrs and all(
+                is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs
+            ) else 0
+            metrics.append(
+                f'smartmon_device_historical_temperature_failure{{{metric_labels}}} {float(historical_temperature_attr_failure)}'
+            )


The logic here can be simplified and optimized. By passing the already computed failed_attrs to the updated smart_health_value function, we avoid a redundant call. Additionally, the historical_temperature_attr_failure metric can be derived directly from the is_healthy status and the assessment string, as is_healthy is only 1 for a WARN state if all failures are historical temperature/airflow breaches.

assessment_upper = str(device.assessment).strip().upper() failed_attrs = get_failed_smart_attributes(device) is_healthy = smart_health_value(device, failed_attrs) metrics.append( f'smartmon_device_smart_healthy{{metric_labels}} {float(is_healthy)}' ) historical_temperature_attr_failure = 1 if is_healthy == 1 and assessment_upper == "WARN" else 0 metrics.append( f'smartmon_device_historical_temperature_failure{{metric_labels}} {float(historical_temperature_attr_failure)}' )

Fix smartmon health status reporting

f480e4a

technowhizz requested a review from dougszumski May 22, 2026 13:05

technowhizz self-assigned this May 22, 2026

technowhizz requested a review from a team as a code owner May 22, 2026 13:05

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

technowhizz temporarily deployed to SMS Lab May 22, 2026 13:08 — with GitHub Actions Inactive

dougszumski approved these changes May 22, 2026

View reviewed changes

Alex-Welsh added the waiting-author-response label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix smartmon health status reporting#2322

Fix smartmon health status reporting#2322
technowhizz wants to merge 1 commit into
stackhpc/2025.1from
smartmon-historical-fails

technowhizz commented May 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

technowhizz commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

technowhizz commented May 22, 2026 •

edited

Loading