Skip to content

Fix smartmon health status reporting#2322

Open
technowhizz wants to merge 1 commit into
stackhpc/2025.1from
smartmon-historical-fails
Open

Fix smartmon health status reporting#2322
technowhizz wants to merge 1 commit into
stackhpc/2025.1from
smartmon-historical-fails

Conversation

@technowhizz
Copy link
Copy Markdown
Contributor

@technowhizz technowhizz commented May 22, 2026

The smartmon exporter now treats historical SMART temperature and airflow threshold breaches as non-critical when calculating the smartmon_device_smart_healthy metric. This prevents disks with only a past over-temperature event from being reported as actively unhealthy.

@technowhizz technowhizz requested a review from dougszumski May 22, 2026 13:05
@technowhizz technowhizz self-assigned this May 22, 2026
@technowhizz technowhizz requested a review from a team as a code owner May 22, 2026 13:05
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the smartmon exporter to treat historical SMART temperature and airflow threshold breaches as non-critical when calculating the device health metric. It also introduces a new metric, smartmon_device_historical_temperature_failure, to allow these events to be monitored separately. The review feedback suggests optimizing the implementation by passing pre-computed failed attributes to the health evaluation function to avoid redundant processing and simplifying the logic for deriving the new metric.

Comment on lines +189 to +213
def smart_health_value(device):
"""
Convert pySMART assessment into the exported healthy metric.

PASS is healthy. WARN is also treated as healthy only when every failed
attribute is a historical temperature/airflow threshold breach. Other WARN
states, FAIL states, current failures, and non-temperature historical
failures remain unhealthy.
"""
assessment = str(device.assessment or "").strip().upper()

if assessment == "PASS":
return 1

if assessment != "WARN":
return 0

failed_attrs = get_failed_smart_attributes(device)
if not failed_attrs:
return 0

if all(is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs):
return 1

return 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The smart_health_value function can be optimized by accepting failed_attrs as an argument. This avoids redundant calls to get_failed_smart_attributes when this function is used in conjunction with other logic that also requires the list of failed attributes, as seen in parse_device_info.

def smart_health_value(device, failed_attrs):
    """
    Convert pySMART assessment into the exported healthy metric.

    PASS is healthy. WARN is also treated as healthy only when every failed
    attribute is a historical temperature/airflow threshold breach. Other WARN
    states, FAIL states, current failures, and non-temperature historical
    failures remain unhealthy.
    """
    assessment = str(device.assessment or "").strip().upper()

    if assessment == "PASS":
        return 1

    if assessment != "WARN":
        return 0

    if not failed_attrs:
        return 0

    if all(is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs):
        return 1

    return 0

Comment on lines +303 to +313
is_healthy = smart_health_value(device)
metrics.append(
f'smartmon_device_smart_healthy{{{metric_labels}}} {float(is_healthy)}'
)
failed_attrs = get_failed_smart_attributes(device)
historical_temperature_attr_failure = 1 if failed_attrs and all(
is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs
) else 0
metrics.append(
f'smartmon_device_historical_temperature_failure{{{metric_labels}}} {float(historical_temperature_attr_failure)}'
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic here can be simplified and optimized. By passing the already computed failed_attrs to the updated smart_health_value function, we avoid a redundant call. Additionally, the historical_temperature_attr_failure metric can be derived directly from the is_healthy status and the assessment string, as is_healthy is only 1 for a WARN state if all failures are historical temperature/airflow breaches.

            assessment_upper = str(device.assessment).strip().upper()
            failed_attrs = get_failed_smart_attributes(device)
            is_healthy = smart_health_value(device, failed_attrs)
            metrics.append(
                f'smartmon_device_smart_healthy{{metric_labels}} {float(is_healthy)}'
            )
            historical_temperature_attr_failure = 1 if is_healthy == 1 and assessment_upper == "WARN" else 0
            metrics.append(
                f'smartmon_device_historical_temperature_failure{{metric_labels}} {float(historical_temperature_attr_failure)}'
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants