Skip to content

pmda: add NIXL Performance Metrics Domain Agent#2599

Open
natoscott wants to merge 1 commit into
performancecopilot:mainfrom
natoscott:nixl-pmda
Open

pmda: add NIXL Performance Metrics Domain Agent#2599
natoscott wants to merge 1 commit into
performancecopilot:mainfrom
natoscott:nixl-pmda

Conversation

@natoscott
Copy link
Copy Markdown
Member

Summary

  • New pmdanixl Python PMDA reads telemetry from NIXL (NVIDIA Inference Xfer Library) agents via their shared-memory ring buffers
  • Each running NIXL agent becomes a separate instance in the nixl instance domain; 20 metrics across 4 clusters: transfer, memory, performance, errors
  • Configured via /etc/pcp/nixl/nixl.conf — supports telemetry_dir and user options so PMDA permissions align with the NIXL application user
  • Domain 39 (FREE_39) assigned in stdpmid
  • pcp-pmda-nixl RPM sub-package added to pcp.spec.in
  • QA test 1999 added; uses qa/src/nixl_telemetry.python to synthesise a ring-buffer file with known values and verifies per-agent metric values without requiring a live NIXL application
  • nixl added to qa/common.filter (domain 39, pmda name, top-pmns entries)

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Adds NVIDIA NIXL PMDA to collect telemetry via shared-memory ring buffers and expose per-agent transfer, memory, performance, and error metrics
    • Installation/uninstallation support via packaged PMDA and configuration options; PMDA now occupies domain 39
  • Tests

    • End-to-end QA test and synthetic telemetry generator for NIXL validation
  • Documentation

    • Added man page and configuration documentation for the NIXL PMDA

Walkthrough

Adds a new NIXL PMDA (domain 39): telemetry parsing and ring-buffer reader, metric registration and fetch callbacks, build/install scripts, config and manpage, QA telemetry generator and test, QA filter/group integration, and RPM subpackage packaging/install/uninstall logic.

Changes

NIXL PMDA Implementation and Integration

Layer / File(s) Summary
Domain registration
src/pmns/stdpmid.pcp
Domain 39 is allocated to NIXL in the PCP namespace registry.
Telemetry structures and counter mappings
src/pmdas/nixl/pmdanixl.python
Ctypes structures for telemetry events and buffer headers, per-agent counter groupings, and event-type-to-field mappings are defined for accumulation and fetch routing.
Ring buffer reader and PMDA metric registration
src/pmdas/nixl/pmdanixl.python
NixlRingBuffer opens and mmaps telemetry files with O_NOFOLLOW, validates versions, drains events into per-agent stats; NIXLPMDA scans telemetry_dir, registers indom/metrics, wires fetch callbacks, and refreshes indom.
Build system, installation, and configuration
src/pmdas/GNUmakefile, src/pmdas/nixl/GNUmakefile, src/pmdas/nixl/Install, src/pmdas/nixl/Remove, src/pmdas/nixl/nixl.conf
Makefiles add nixl to PYPMDAS and provide HAVE_PYTHON-gated build/install targets; Install/Remove scripts run pmdaInstall/pmdaRemove; nixl.conf documents runtime settings.
Manual page
src/pmdas/nixl/pmdanixl.1
pmdanixl(1) documents PMDA purpose, exported metric namespaces, configuration, installation/removal, environment variables, and file locations.
QA test infrastructure and scripts
qa/src/nixl_telemetry.python, qa/1999, qa/common.filter, qa/group
Deterministic telemetry generator and end-to-end QA script test PMDA install/registration and sample metrics; QA filters and group entry integrate the test into the QA harness.
RPM subpackage definition
build/rpm/pcp.spec.in
RPM spec adds %package pmda-nixl with dependencies, includes nixl in manifest/packages.list, adds %preun removal scriptlet, and a %files pmda-nixl -f ... stanza for package payload lifecycle.

Poem

A rabbit wrote a tiny test,
A ring buffer baked and placed to rest,
Domain thirty-nine hums soft and bright,
Metrics counted through the night,
Install, remove, then metrics gleam—hooray! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new NIXL Performance Metrics Domain Agent (PMDA) to the project.
Description check ✅ Passed The description is directly related to the changeset, providing a comprehensive summary of the NIXL PMDA implementation, configuration, packaging, and QA additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
qa/src/nixl_telemetry.python (1)

120-120: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a single trailing newline at EOF.

This file appears to end without a final newline; please add exactly one trailing newline.

As per coding guidelines "Ensure all text and source files end with a single trailing newline character".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@qa/src/nixl_telemetry.python` at line 120, The file nixl_telemetry.python
ends without a final newline; open the file and add exactly one trailing newline
character at EOF so it ends with a single newline (no extra blank lines), then
save; this ensures the file conforms to the "single trailing newline" rule.
qa/1999 (1)

114-114: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a single trailing newline at EOF.

This file appears to end without a final newline; please add exactly one trailing newline.

As per coding guidelines "Ensure all text and source files end with a single trailing newline character".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@qa/1999` at line 114, Add exactly one trailing newline character at the end
of this file so it ends with a single newline (ensure there are no extra blank
lines); modify the file's EOF to include one '\n' character so editors and tools
recognize the final newline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@qa/1999`:
- Line 77: The failure message uses the wrong variable name ($seq.full) for the
log path; update the echo statement that prints "Arrgh! PMDA install failed ...
see $seq.full" to reference the correct variable $seq_full so it matches the
rest of the script's log file usage.

In `@qa/src/nixl_telemetry.python`:
- Around line 16-17: Update the usage/help text to reference the actual script
filename: replace occurrences of "create_telemetry.python" with
"nixl_telemetry.python" in the file (the user-facing usage string near the top
and the later usage line around the end); ensure any error or help messages that
print the script name use "nixl_telemetry.python" so the displayed usage matches
the real filename.

In `@src/pmdas/nixl/pmdanixl.1`:
- Around line 27-30: The man page currently lists two different default
telemetry directories; pick one canonical default (either "/tmp" or "/var/tmp")
and update every occurrence to match it: replace the default shown next to the
NIXL_TELEMETRY_DIR mention and the other instances around the page (the
occurrences near lines referencing the NIXL_TELEMETRY_DIR and its description).
Ensure the same string is used consistently for all mentions (e.g., update the
text fragments that currently say ".BR /tmp" or "/var/tmp" to the single chosen
default) so all references to NIXL_TELEMETRY_DIR and its default are identical
across the file.

In `@src/pmdas/nixl/pmdanixl.python`:
- Around line 21-24: The module docstring currently states the default telemetry
directory as "/tmp" but the code uses "/var/tmp"; update the top-level module
docstring in src/pmdas/nixl/pmdanixl.python to state the default
NIXL_TELEMETRY_DIR is "/var/tmp" (or change the code to use "/tmp" if you prefer
the docstring) so the documented default matches the actual value used by the
NIXL_TELEMETRY_DIR constant/usage in the module.
- Around line 303-345: The _agents dict is never pruned so stale/vanished agents
remain; update _discover_agents and _drain_all so that when an agent is removed
or its ring buffer fails it is evicted from both caches. Specifically: in
_discover_agents, detect when an existing key in self._buffers no longer has a
backing file and close and delete both self._buffers[name] and
self._agents[name]; in _drain_all, when an exception marks an agent dead, close
and remove entries from both self._buffers and self._agents (not just
self._buffers); ensure _fetch continues to call _discover_agents and _drain_all
so replace_indom(self._agent_indom, self._agents) only publishes
currently-existing agents.
- Around line 38-39: The telemetry discovery/open path must reject symlinks and
validate files before mmap/open: in _discover_agents() use os.lstat to skip any
entry where stat.S_ISLNK(mode) is true (or call os.open with os.O_NOFOLLOW) and
only consider files where stat.S_ISREG(mode); in NixlRingBuffer.__init__ open
the file with os.open flags including O_NOFOLLOW (and use O_RDONLY if write not
needed), then fstat the returned fd and verify S_ISREG and expected uid/gid and
mode bits before proceeding to mmap; if any check fails, close the fd and
skip/report the file.

---

Outside diff comments:
In `@qa/1999`:
- Line 114: Add exactly one trailing newline character at the end of this file
so it ends with a single newline (ensure there are no extra blank lines); modify
the file's EOF to include one '\n' character so editors and tools recognize the
final newline.

In `@qa/src/nixl_telemetry.python`:
- Line 120: The file nixl_telemetry.python ends without a final newline; open
the file and add exactly one trailing newline character at EOF so it ends with a
single newline (no extra blank lines), then save; this ensures the file conforms
to the "single trailing newline" rule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 662f5ac2-9eb2-450e-9b4b-8408a48f9b2d

📥 Commits

Reviewing files that changed from the base of the PR and between 3c60b37 and f689c20.

⛔ Files ignored due to path filters (1)
  • qa/1999.out is excluded by !**/*.out
📒 Files selected for processing (13)
  • build/rpm/pcp.spec.in
  • qa/1999
  • qa/common.filter
  • qa/group
  • qa/src/nixl_telemetry.python
  • src/pmdas/GNUmakefile
  • src/pmdas/nixl/GNUmakefile
  • src/pmdas/nixl/Install
  • src/pmdas/nixl/Remove
  • src/pmdas/nixl/nixl.conf
  • src/pmdas/nixl/pmdanixl.1
  • src/pmdas/nixl/pmdanixl.python
  • src/pmns/stdpmid.pcp

Comment thread qa/1999 Outdated
Comment thread qa/src/nixl_telemetry.python Outdated
Comment thread src/pmdas/nixl/pmdanixl.1 Outdated
Comment thread src/pmdas/nixl/pmdanixl.python
Comment thread src/pmdas/nixl/pmdanixl.python
Comment thread src/pmdas/nixl/pmdanixl.python
pmdanixl reads telemetry from NIXL (NVIDIA Inference Xfer Library)
agents via their shared-memory ring buffers and exposes accumulated
per-agent counters to PCP.

Each running NIXL agent that has written a telemetry file to the
configured telemetry_dir becomes a separate instance in the nixl
instance domain.  Twenty metrics are provided across four clusters:
transfer (TX/RX bytes and request counts), memory (registered and
deregistered sizes), performance (transfer times in microseconds),
and errors (twelve error event types).

The PMDA is configured via /etc/pcp/nixl/nixl.conf which supports
setting the telemetry_dir and the user account under which the PMDA
runs, allowing permissions to align with the NIXL application user
without requiring world-writable telemetry files.

Domain 39 (FREE_39) is assigned to NIXL in stdpmid.  A QA test
(1999) uses a synthetic ring-buffer file created by
qa/src/nixl_telemetry.python to verify metric values without
requiring a live NIXL application.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@natoscott
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pmdas/nixl/pmdanixl.python`:
- Around line 314-327: The current try/except around
os.listdir(self._telemetry_dir) swallows OSError and sets entries = set(), which
causes the loop over self._buffers to think all files are gone and evict every
cached agent (_buffers and _agents), losing state on transient FS errors;
instead, if os.listdir raises OSError, do not set entries to empty—log or record
the error and skip the purge step (i.e., only run the for fname in
list(self._buffers) purge when a valid entries set was obtained). Concretely:
change the except branch so it does not assign entries = set(), set a flag like
entries_available=False (or return/continue), and only perform the missing-file
cleanup when entries_available is True, referencing self._telemetry_dir,
_buffers and _agents in the fix.
- Around line 196-215: NixlRingBuffer.__init__ can leak file descriptors and
mmap objects if an error occurs after the first mmap; wrap the initialization
logic that creates hdr_map, the second mmap (self._map) and binds
self._hdr/self._data in a try/except/finally (or use context managers) so any
partial state is cleaned up: ensure hdr_map is closed and self._fd is closed on
failure, and if self._map was successfully created it is closed and any
self._hdr/self._data references are deleted before re-raising the exception;
reference BufferHeader, NixlTelemetryEvent, self._fd, hdr_map, self._map,
self._hdr and self._data when adding the cleanup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 5a3a625c-0486-48ef-990e-c4903375e07f

📥 Commits

Reviewing files that changed from the base of the PR and between f689c20 and a53a9e2.

⛔ Files ignored due to path filters (1)
  • qa/1999.out is excluded by !**/*.out
📒 Files selected for processing (13)
  • build/rpm/pcp.spec.in
  • qa/1999
  • qa/common.filter
  • qa/group
  • qa/src/nixl_telemetry.python
  • src/pmdas/GNUmakefile
  • src/pmdas/nixl/GNUmakefile
  • src/pmdas/nixl/Install
  • src/pmdas/nixl/Remove
  • src/pmdas/nixl/nixl.conf
  • src/pmdas/nixl/pmdanixl.1
  • src/pmdas/nixl/pmdanixl.python
  • src/pmns/stdpmid.pcp
✅ Files skipped from review due to trivial changes (2)
  • src/pmdas/nixl/nixl.conf
  • src/pmdas/nixl/pmdanixl.1

Comment on lines +196 to +215
hdr_size = ctypes.sizeof(BufferHeader)

hdr_map = mmap.mmap(self._fd, hdr_size, mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE)
tmp_hdr = BufferHeader.from_buffer(hdr_map)
if tmp_hdr.version != TELEMETRY_VERSION:
del tmp_hdr
hdr_map.close()
os.close(self._fd)
raise ValueError("version mismatch: not a NIXL telemetry file")
capacity = tmp_hdr.capacity
del tmp_hdr
hdr_map.close()

total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE)
self._hdr = BufferHeader.from_buffer(self._map)
self._data = (NixlTelemetryEvent * capacity).from_buffer(
self._map, hdr_size)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add cleanup for late initialization failures in NixlRingBuffer.__init__.

If setup fails after Line 196 (e.g., second mmap or buffer binding), the fd/map path is not fully guarded, so malformed files can repeatedly leak resources and eventually block new opens.

Suggested fix
         hdr_map.close()

-        total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
-        self._map  = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
-                               mmap.PROT_READ | mmap.PROT_WRITE)
-        self._hdr  = BufferHeader.from_buffer(self._map)
-        self._data = (NixlTelemetryEvent * capacity).from_buffer(
-            self._map, hdr_size)
+        self._map = None
+        try:
+            total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
+            self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
+                                  mmap.PROT_READ | mmap.PROT_WRITE)
+            self._hdr = BufferHeader.from_buffer(self._map)
+            self._data = (NixlTelemetryEvent * capacity).from_buffer(
+                self._map, hdr_size)
+        except Exception:
+            if hasattr(self, "_data"):
+                del self._data
+            if hasattr(self, "_hdr"):
+                del self._hdr
+            if self._map is not None:
+                self._map.close()
+                self._map = None
+            os.close(self._fd)
+            self._fd = -1
+            raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hdr_size = ctypes.sizeof(BufferHeader)
hdr_map = mmap.mmap(self._fd, hdr_size, mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE)
tmp_hdr = BufferHeader.from_buffer(hdr_map)
if tmp_hdr.version != TELEMETRY_VERSION:
del tmp_hdr
hdr_map.close()
os.close(self._fd)
raise ValueError("version mismatch: not a NIXL telemetry file")
capacity = tmp_hdr.capacity
del tmp_hdr
hdr_map.close()
total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE)
self._hdr = BufferHeader.from_buffer(self._map)
self._data = (NixlTelemetryEvent * capacity).from_buffer(
self._map, hdr_size)
hdr_size = ctypes.sizeof(BufferHeader)
hdr_map = mmap.mmap(self._fd, hdr_size, mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE)
tmp_hdr = BufferHeader.from_buffer(hdr_map)
if tmp_hdr.version != TELEMETRY_VERSION:
del tmp_hdr
hdr_map.close()
os.close(self._fd)
raise ValueError("version mismatch: not a NIXL telemetry file")
capacity = tmp_hdr.capacity
del tmp_hdr
hdr_map.close()
self._map = None
try:
total = hdr_size + ctypes.sizeof(NixlTelemetryEvent) * capacity
self._map = mmap.mmap(self._fd, total, mmap.MAP_SHARED,
mmap.PROT_READ | mmap.PROT_WRITE)
self._hdr = BufferHeader.from_buffer(self._map)
self._data = (NixlTelemetryEvent * capacity).from_buffer(
self._map, hdr_size)
except Exception:
if hasattr(self, "_data"):
del self._data
if hasattr(self, "_hdr"):
del self._hdr
if self._map is not None:
self._map.close()
self._map = None
os.close(self._fd)
self._fd = -1
raise
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pmdas/nixl/pmdanixl.python` around lines 196 - 215,
NixlRingBuffer.__init__ can leak file descriptors and mmap objects if an error
occurs after the first mmap; wrap the initialization logic that creates hdr_map,
the second mmap (self._map) and binds self._hdr/self._data in a
try/except/finally (or use context managers) so any partial state is cleaned up:
ensure hdr_map is closed and self._fd is closed on failure, and if self._map was
successfully created it is closed and any self._hdr/self._data references are
deleted before re-raising the exception; reference BufferHeader,
NixlTelemetryEvent, self._fd, hdr_map, self._map, self._hdr and self._data when
adding the cleanup.

Comment on lines +314 to +327
try:
entries = set(os.listdir(self._telemetry_dir))
except OSError:
entries = set()

for fname in list(self._buffers):
if fname not in entries:
try:
self._buffers[fname].close()
except Exception:
pass
del self._buffers[fname]
del self._agents[fname]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid purging all agents on a transient directory read failure.

At Line 316, converting OSError into entries = set() causes Line 319-Line 327 to evict every cached agent, which drops visible instances and resets accumulated counters on temporary FS/permission glitches.

Suggested fix
     def _discover_agents(self):
         """Scan telemetry dir, open ring buffers for new agents, evict any removed."""
         try:
             entries = set(os.listdir(self._telemetry_dir))
-        except OSError:
-            entries = set()
+        except OSError:
+            return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pmdas/nixl/pmdanixl.python` around lines 314 - 327, The current
try/except around os.listdir(self._telemetry_dir) swallows OSError and sets
entries = set(), which causes the loop over self._buffers to think all files are
gone and evict every cached agent (_buffers and _agents), losing state on
transient FS errors; instead, if os.listdir raises OSError, do not set entries
to empty—log or record the error and skip the purge step (i.e., only run the for
fname in list(self._buffers) purge when a valid entries set was obtained).
Concretely: change the except branch so it does not assign entries = set(), set
a flag like entries_available=False (or return/continue), and only perform the
missing-file cleanup when entries_available is True, referencing
self._telemetry_dir, _buffers and _agents in the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant