pmda: add NIXL Performance Metrics Domain Agent#2599
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a new NIXL PMDA (domain 39): telemetry parsing and ring-buffer reader, metric registration and fetch callbacks, build/install scripts, config and manpage, QA telemetry generator and test, QA filter/group integration, and RPM subpackage packaging/install/uninstall logic. ChangesNIXL PMDA Implementation and Integration
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
qa/src/nixl_telemetry.python (1)
120-120:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a single trailing newline at EOF.
This file appears to end without a final newline; please add exactly one trailing newline.
As per coding guidelines "Ensure all text and source files end with a single trailing newline character".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@qa/src/nixl_telemetry.python` at line 120, The file nixl_telemetry.python ends without a final newline; open the file and add exactly one trailing newline character at EOF so it ends with a single newline (no extra blank lines), then save; this ensures the file conforms to the "single trailing newline" rule.qa/1999 (1)
114-114:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a single trailing newline at EOF.
This file appears to end without a final newline; please add exactly one trailing newline.
As per coding guidelines "Ensure all text and source files end with a single trailing newline character".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@qa/1999` at line 114, Add exactly one trailing newline character at the end of this file so it ends with a single newline (ensure there are no extra blank lines); modify the file's EOF to include one '\n' character so editors and tools recognize the final newline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@qa/1999`:
- Line 77: The failure message uses the wrong variable name ($seq.full) for the
log path; update the echo statement that prints "Arrgh! PMDA install failed ...
see $seq.full" to reference the correct variable $seq_full so it matches the
rest of the script's log file usage.
In `@qa/src/nixl_telemetry.python`:
- Around line 16-17: Update the usage/help text to reference the actual script
filename: replace occurrences of "create_telemetry.python" with
"nixl_telemetry.python" in the file (the user-facing usage string near the top
and the later usage line around the end); ensure any error or help messages that
print the script name use "nixl_telemetry.python" so the displayed usage matches
the real filename.
In `@src/pmdas/nixl/pmdanixl.1`:
- Around line 27-30: The man page currently lists two different default
telemetry directories; pick one canonical default (either "/tmp" or "/var/tmp")
and update every occurrence to match it: replace the default shown next to the
NIXL_TELEMETRY_DIR mention and the other instances around the page (the
occurrences near lines referencing the NIXL_TELEMETRY_DIR and its description).
Ensure the same string is used consistently for all mentions (e.g., update the
text fragments that currently say ".BR /tmp" or "/var/tmp" to the single chosen
default) so all references to NIXL_TELEMETRY_DIR and its default are identical
across the file.
In `@src/pmdas/nixl/pmdanixl.python`:
- Around line 21-24: The module docstring currently states the default telemetry
directory as "/tmp" but the code uses "/var/tmp"; update the top-level module
docstring in src/pmdas/nixl/pmdanixl.python to state the default
NIXL_TELEMETRY_DIR is "/var/tmp" (or change the code to use "/tmp" if you prefer
the docstring) so the documented default matches the actual value used by the
NIXL_TELEMETRY_DIR constant/usage in the module.
- Around line 303-345: The _agents dict is never pruned so stale/vanished agents
remain; update _discover_agents and _drain_all so that when an agent is removed
or its ring buffer fails it is evicted from both caches. Specifically: in
_discover_agents, detect when an existing key in self._buffers no longer has a
backing file and close and delete both self._buffers[name] and
self._agents[name]; in _drain_all, when an exception marks an agent dead, close
and remove entries from both self._buffers and self._agents (not just
self._buffers); ensure _fetch continues to call _discover_agents and _drain_all
so replace_indom(self._agent_indom, self._agents) only publishes
currently-existing agents.
- Around line 38-39: The telemetry discovery/open path must reject symlinks and
validate files before mmap/open: in _discover_agents() use os.lstat to skip any
entry where stat.S_ISLNK(mode) is true (or call os.open with os.O_NOFOLLOW) and
only consider files where stat.S_ISREG(mode); in NixlRingBuffer.__init__ open
the file with os.open flags including O_NOFOLLOW (and use O_RDONLY if write not
needed), then fstat the returned fd and verify S_ISREG and expected uid/gid and
mode bits before proceeding to mmap; if any check fails, close the fd and
skip/report the file.
---
Outside diff comments:
In `@qa/1999`:
- Line 114: Add exactly one trailing newline character at the end of this file
so it ends with a single newline (ensure there are no extra blank lines); modify
the file's EOF to include one '\n' character so editors and tools recognize the
final newline.
In `@qa/src/nixl_telemetry.python`:
- Line 120: The file nixl_telemetry.python ends without a final newline; open
the file and add exactly one trailing newline character at EOF so it ends with a
single newline (no extra blank lines), then save; this ensures the file conforms
to the "single trailing newline" rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 662f5ac2-9eb2-450e-9b4b-8408a48f9b2d
⛔ Files ignored due to path filters (1)
qa/1999.outis excluded by!**/*.out
📒 Files selected for processing (13)
build/rpm/pcp.spec.inqa/1999qa/common.filterqa/groupqa/src/nixl_telemetry.pythonsrc/pmdas/GNUmakefilesrc/pmdas/nixl/GNUmakefilesrc/pmdas/nixl/Installsrc/pmdas/nixl/Removesrc/pmdas/nixl/nixl.confsrc/pmdas/nixl/pmdanixl.1src/pmdas/nixl/pmdanixl.pythonsrc/pmns/stdpmid.pcp
pmdanixl reads telemetry from NIXL (NVIDIA Inference Xfer Library) agents via their shared-memory ring buffers and exposes accumulated per-agent counters to PCP. Each running NIXL agent that has written a telemetry file to the configured telemetry_dir becomes a separate instance in the nixl instance domain. Twenty metrics are provided across four clusters: transfer (TX/RX bytes and request counts), memory (registered and deregistered sizes), performance (transfer times in microseconds), and errors (twelve error event types). The PMDA is configured via /etc/pcp/nixl/nixl.conf which supports setting the telemetry_dir and the user account under which the PMDA runs, allowing permissions to align with the NIXL application user without requiring world-writable telemetry files. Domain 39 (FREE_39) is assigned to NIXL in stdpmid. A QA test (1999) uses a synthetic ring-buffer file created by qa/src/nixl_telemetry.python to verify metric values without requiring a live NIXL application. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/pmdas/nixl/pmdanixl.python`:
- Around line 314-327: The current try/except around
os.listdir(self._telemetry_dir) swallows OSError and sets entries = set(), which
causes the loop over self._buffers to think all files are gone and evict every
cached agent (_buffers and _agents), losing state on transient FS errors;
instead, if os.listdir raises OSError, do not set entries to empty—log or record
the error and skip the purge step (i.e., only run the for fname in
list(self._buffers) purge when a valid entries set was obtained). Concretely:
change the except branch so it does not assign entries = set(), set a flag like
entries_available=False (or return/continue), and only perform the missing-file
cleanup when entries_available is True, referencing self._telemetry_dir,
_buffers and _agents in the fix.
- Around line 196-215: NixlRingBuffer.__init__ can leak file descriptors and
mmap objects if an error occurs after the first mmap; wrap the initialization
logic that creates hdr_map, the second mmap (self._map) and binds
self._hdr/self._data in a try/except/finally (or use context managers) so any
partial state is cleaned up: ensure hdr_map is closed and self._fd is closed on
failure, and if self._map was successfully created it is closed and any
self._hdr/self._data references are deleted before re-raising the exception;
reference BufferHeader, NixlTelemetryEvent, self._fd, hdr_map, self._map,
self._hdr and self._data when adding the cleanup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 5a3a625c-0486-48ef-990e-c4903375e07f
⛔ Files ignored due to path filters (1)
qa/1999.outis excluded by!**/*.out
📒 Files selected for processing (13)
build/rpm/pcp.spec.inqa/1999qa/common.filterqa/groupqa/src/nixl_telemetry.pythonsrc/pmdas/GNUmakefilesrc/pmdas/nixl/GNUmakefilesrc/pmdas/nixl/Installsrc/pmdas/nixl/Removesrc/pmdas/nixl/nixl.confsrc/pmdas/nixl/pmdanixl.1src/pmdas/nixl/pmdanixl.pythonsrc/pmns/stdpmid.pcp
✅ Files skipped from review due to trivial changes (2)
- src/pmdas/nixl/nixl.conf
- src/pmdas/nixl/pmdanixl.1
| hdr_size = ctypes.sizeof(BufferHeader) | ||
|
|
||
| hdr_map = mmap.mmap(self._fd, hdr_size, mmap.MAP_SHARED, | ||
| mmap.PROT_READ | mmap.PROT_WRITE) | ||
| tmp_hdr = BufferHeader.from_buffer(hdr_map) | ||
| if tmp_hdr.version != TELEMETRY_VERSION: | ||
| del tmp_hdr | ||
| hdr_map.close() | ||
| os.close(self._fd) | ||
| raise ValueError("version mismatch: not a NIXL telemetry file") | ||
| capacity = tmp_hdr.capacity | ||
| del tmp_hdr | ||
| hdr_map.close() | ||
|
|
||
| total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity | ||
| self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED, | ||
| mmap.PROT_READ | mmap.PROT_WRITE) | ||
| self._hdr = BufferHeader.from_buffer(self._map) | ||
| self._data = (NixlTelemetryEvent * capacity).from_buffer( | ||
| self._map, hdr_size) |
There was a problem hiding this comment.
Add cleanup for late initialization failures in NixlRingBuffer.__init__.
If setup fails after Line 196 (e.g., second mmap or buffer binding), the fd/map path is not fully guarded, so malformed files can repeatedly leak resources and eventually block new opens.
Suggested fix
hdr_map.close()
- total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
- self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
- mmap.PROT_READ | mmap.PROT_WRITE)
- self._hdr = BufferHeader.from_buffer(self._map)
- self._data = (NixlTelemetryEvent * capacity).from_buffer(
- self._map, hdr_size)
+ self._map = None
+ try:
+ total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
+ self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
+ mmap.PROT_READ | mmap.PROT_WRITE)
+ self._hdr = BufferHeader.from_buffer(self._map)
+ self._data = (NixlTelemetryEvent * capacity).from_buffer(
+ self._map, hdr_size)
+ except Exception:
+ if hasattr(self, "_data"):
+ del self._data
+ if hasattr(self, "_hdr"):
+ del self._hdr
+ if self._map is not None:
+ self._map.close()
+ self._map = None
+ os.close(self._fd)
+ self._fd = -1
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hdr_size = ctypes.sizeof(BufferHeader) | |
| hdr_map = mmap.mmap(self._fd, hdr_size, mmap.MAP_SHARED, | |
| mmap.PROT_READ | mmap.PROT_WRITE) | |
| tmp_hdr = BufferHeader.from_buffer(hdr_map) | |
| if tmp_hdr.version != TELEMETRY_VERSION: | |
| del tmp_hdr | |
| hdr_map.close() | |
| os.close(self._fd) | |
| raise ValueError("version mismatch: not a NIXL telemetry file") | |
| capacity = tmp_hdr.capacity | |
| del tmp_hdr | |
| hdr_map.close() | |
| total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity | |
| self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED, | |
| mmap.PROT_READ | mmap.PROT_WRITE) | |
| self._hdr = BufferHeader.from_buffer(self._map) | |
| self._data = (NixlTelemetryEvent * capacity).from_buffer( | |
| self._map, hdr_size) | |
| hdr_size = ctypes.sizeof(BufferHeader) | |
| hdr_map = mmap.mmap(self._fd, hdr_size, mmap.MAP_SHARED, | |
| mmap.PROT_READ | mmap.PROT_WRITE) | |
| tmp_hdr = BufferHeader.from_buffer(hdr_map) | |
| if tmp_hdr.version != TELEMETRY_VERSION: | |
| del tmp_hdr | |
| hdr_map.close() | |
| os.close(self._fd) | |
| raise ValueError("version mismatch: not a NIXL telemetry file") | |
| capacity = tmp_hdr.capacity | |
| del tmp_hdr | |
| hdr_map.close() | |
| self._map = None | |
| try: | |
| total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity | |
| self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED, | |
| mmap.PROT_READ | mmap.PROT_WRITE) | |
| self._hdr = BufferHeader.from_buffer(self._map) | |
| self._data = (NixlTelemetryEvent * capacity).from_buffer( | |
| self._map, hdr_size) | |
| except Exception: | |
| if hasattr(self, "_data"): | |
| del self._data | |
| if hasattr(self, "_hdr"): | |
| del self._hdr | |
| if self._map is not None: | |
| self._map.close() | |
| self._map = None | |
| os.close(self._fd) | |
| self._fd = -1 | |
| raise |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/pmdas/nixl/pmdanixl.python` around lines 196 - 215,
NixlRingBuffer.__init__ can leak file descriptors and mmap objects if an error
occurs after the first mmap; wrap the initialization logic that creates hdr_map,
the second mmap (self._map) and binds self._hdr/self._data in a
try/except/finally (or use context managers) so any partial state is cleaned up:
ensure hdr_map is closed and self._fd is closed on failure, and if self._map was
successfully created it is closed and any self._hdr/self._data references are
deleted before re-raising the exception; reference BufferHeader,
NixlTelemetryEvent, self._fd, hdr_map, self._map, self._hdr and self._data when
adding the cleanup.
| try: | ||
| entries = set(os.listdir(self._telemetry_dir)) | ||
| except OSError: | ||
| entries = set() | ||
|
|
||
| for fname in list(self._buffers): | ||
| if fname not in entries: | ||
| try: | ||
| self._buffers[fname].close() | ||
| except Exception: | ||
| pass | ||
| del self._buffers[fname] | ||
| del self._agents[fname] | ||
|
|
There was a problem hiding this comment.
Avoid purging all agents on a transient directory read failure.
At Line 316, converting OSError into entries = set() causes Line 319-Line 327 to evict every cached agent, which drops visible instances and resets accumulated counters on temporary FS/permission glitches.
Suggested fix
def _discover_agents(self):
"""Scan telemetry dir, open ring buffers for new agents, evict any removed."""
try:
entries = set(os.listdir(self._telemetry_dir))
- except OSError:
- entries = set()
+ except OSError:
+ return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/pmdas/nixl/pmdanixl.python` around lines 314 - 327, The current
try/except around os.listdir(self._telemetry_dir) swallows OSError and sets
entries = set(), which causes the loop over self._buffers to think all files are
gone and evict every cached agent (_buffers and _agents), losing state on
transient FS errors; instead, if os.listdir raises OSError, do not set entries
to empty—log or record the error and skip the purge step (i.e., only run the for
fname in list(self._buffers) purge when a valid entries set was obtained).
Concretely: change the except branch so it does not assign entries = set(), set
a flag like entries_available=False (or return/continue), and only perform the
missing-file cleanup when entries_available is True, referencing
self._telemetry_dir, _buffers and _agents in the fix.
Summary
pmdanixlPython PMDA reads telemetry from NIXL (NVIDIA Inference Xfer Library) agents via their shared-memory ring bufferstransfer,memory,performance,errors/etc/pcp/nixl/nixl.conf— supportstelemetry_diranduseroptions so PMDA permissions align with the NIXL application userFREE_39) assigned instdpmidpcp-pmda-nixlRPM sub-package added topcp.spec.inqa/src/nixl_telemetry.pythonto synthesise a ring-buffer file with known values and verifies per-agent metric values without requiring a live NIXL applicationnixladded toqa/common.filter(domain 39, pmda name, top-pmns entries)🤖 Generated with Claude Code