Skip to content

disk-smart: rewrite with JSON parsing, NVMe support, and split into specialized checks #1050

@markuslf

Description

@markuslf

The current disk-smart plugin has several limitations:

  • Text-only parsing of smartctl output, fragile and dependent on exact formatting
  • No NVMe support
  • No JSON support (available since smartctl 7.3, released 2022)
  • Hardcoded attribute checks via if/elif chains instead of a data-driven attribute database
  • ~900 lines in a single file with high cyclomatic complexity
  • When the error log triggers CRIT (which can happen quite often), additional SMART attribute issues go unnoticed because the check is already in CRIT status
  • Missing values in output, e.g. empty model family: * sda (, Samsung SSD 883 DCT 1.92TB, SerNo 12345678) (disk-smart: Missing value in output #291)
  • Unhelpful error messages on smartctl failures, no option to show smartctl output for debugging (disk-smart: Show smartctl output on failure #671)

Proposed changes

Split into specialized checks, e.g.:

  • disk-smart-attributes
  • disk-smart-error-log
  • disk-smart-health
  • disk-smart-self-tests
  • disk-smart-stats
  • disk-smart-temperature

This allows administrators to acknowledge or silence individual aspects (e.g. a noisy error log) without ignoring the overall disk health.

Icinga Director service set for the new split checks, including Journald Query and Systemd Unit for smartd.service (#604).

Shared smartctl cache via lib/cache: Each plugin checks if cached smartctl data is older than 8 hours. If so, it calls smartctl --xall --json and updates the cache. Otherwise it reads from cache. No separate collector plugin needed.

Shared helper library lib/disk_smart.py: Common functions across all disk-smart plugins (cache handling, smartctl invocation, JSON/text parsing, attribute database) are implemented in a shared library to avoid code duplication.

JSON parsing (smartctl 7.3+) with text fallback for older versions.

NVMe support, modeled after GSmartControl 2.0 which implements a comprehensive multi-device parser architecture. NVMe drives expose additional health data that should be evaluated, including:

  • Critical Warning flags
  • Available Spare / Available Spare Threshold
  • Percentage Used (device life estimate)
  • Unsafe Shutdowns
  • Media and Data Integrity Errors
  • Warning/Critical Composite Temperature Time
  • Temperature Sensors

Data-driven attribute database instead of hardcoded if/elif chains, inspired by GSmartControl's storage_property_descr_ata_attribute.cpp (~200 attributes with HDD/SSD distinction).

Improved error handling: Show smartctl output on failure for easier debugging.

Closes #4
Closes #222
Closes #291
Closes #568
Closes #604
Closes #671

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions