Fix smartmon health status reporting#2322
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the smartmon exporter to treat historical SMART temperature and airflow threshold breaches as non-critical when calculating the device health metric. It also introduces a new metric, smartmon_device_historical_temperature_failure, to allow these events to be monitored separately. The review feedback suggests optimizing the implementation by passing pre-computed failed attributes to the health evaluation function to avoid redundant processing and simplifying the logic for deriving the new metric.
| def smart_health_value(device): | ||
| """ | ||
| Convert pySMART assessment into the exported healthy metric. | ||
|
|
||
| PASS is healthy. WARN is also treated as healthy only when every failed | ||
| attribute is a historical temperature/airflow threshold breach. Other WARN | ||
| states, FAIL states, current failures, and non-temperature historical | ||
| failures remain unhealthy. | ||
| """ | ||
| assessment = str(device.assessment or "").strip().upper() | ||
|
|
||
| if assessment == "PASS": | ||
| return 1 | ||
|
|
||
| if assessment != "WARN": | ||
| return 0 | ||
|
|
||
| failed_attrs = get_failed_smart_attributes(device) | ||
| if not failed_attrs: | ||
| return 0 | ||
|
|
||
| if all(is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs): | ||
| return 1 | ||
|
|
||
| return 0 |
There was a problem hiding this comment.
The smart_health_value function can be optimized by accepting failed_attrs as an argument. This avoids redundant calls to get_failed_smart_attributes when this function is used in conjunction with other logic that also requires the list of failed attributes, as seen in parse_device_info.
def smart_health_value(device, failed_attrs):
"""
Convert pySMART assessment into the exported healthy metric.
PASS is healthy. WARN is also treated as healthy only when every failed
attribute is a historical temperature/airflow threshold breach. Other WARN
states, FAIL states, current failures, and non-temperature historical
failures remain unhealthy.
"""
assessment = str(device.assessment or "").strip().upper()
if assessment == "PASS":
return 1
if assessment != "WARN":
return 0
if not failed_attrs:
return 0
if all(is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs):
return 1
return 0| is_healthy = smart_health_value(device) | ||
| metrics.append( | ||
| f'smartmon_device_smart_healthy{{{metric_labels}}} {float(is_healthy)}' | ||
| ) | ||
| failed_attrs = get_failed_smart_attributes(device) | ||
| historical_temperature_attr_failure = 1 if failed_attrs and all( | ||
| is_historical_temperature_attr_failure(attribute) for attribute in failed_attrs | ||
| ) else 0 | ||
| metrics.append( | ||
| f'smartmon_device_historical_temperature_failure{{{metric_labels}}} {float(historical_temperature_attr_failure)}' | ||
| ) |
There was a problem hiding this comment.
The logic here can be simplified and optimized. By passing the already computed failed_attrs to the updated smart_health_value function, we avoid a redundant call. Additionally, the historical_temperature_attr_failure metric can be derived directly from the is_healthy status and the assessment string, as is_healthy is only 1 for a WARN state if all failures are historical temperature/airflow breaches.
assessment_upper = str(device.assessment).strip().upper()
failed_attrs = get_failed_smart_attributes(device)
is_healthy = smart_health_value(device, failed_attrs)
metrics.append(
f'smartmon_device_smart_healthy{{metric_labels}} {float(is_healthy)}'
)
historical_temperature_attr_failure = 1 if is_healthy == 1 and assessment_upper == "WARN" else 0
metrics.append(
f'smartmon_device_historical_temperature_failure{{metric_labels}} {float(historical_temperature_attr_failure)}'
)
The smartmon exporter now treats historical SMART temperature and airflow threshold breaches as non-critical when calculating the smartmon_device_smart_healthy metric. This prevents disks with only a past over-temperature event from being reported as actively unhealthy.