Darwin PMDA Phase 2: Apple Silicon metrics expansion#2466
Darwin PMDA Phase 2: Apple Silicon metrics expansion#2466natoscott merged 58 commits intoperformancecopilot:mainfrom
Conversation
Focus on M-series Macs with graceful degradation for Intel. Catalogs ~100 additional metrics across thermal monitoring, GPU utilization, battery/power, enhanced process I/O, IPv6, disk queues, and system limits.
Expose system resource limits via sysctl-based metrics providing visibility into kernel-enforced process and file descriptor limits. New metrics: - kernel.limits.maxproc (kern.maxproc) - kernel.limits.maxprocperuid (kern.maxprocperuid) - kernel.limits.maxfiles (kern.maxfiles) - kernel.limits.maxfilesperproc (kern.maxfilesperproc) - vfs.vnodes.recycled (kern.num_recycledvnodes)
Adds visibility into macOS memory compression performance and health for diagnosing memory pressure. Implements Phase 2 step 2 with six new sysctl-based metrics: timing buckets (30s/60s/300s), thrashing detection, major compactions, and LZ4 compression counts.
Add proc.io.logical_writes and proc.memory.footprint to surface data already fetched via proc_pid_rusage(). Uses rusage_info_v4 with v3 fallback for older macOS versions.
Documents architecture for IOKit-based GPU monitoring on macOS. Covers utilization and memory metrics with TDD approach.
Enable visibility into GPU workloads on macOS via IOKit IOAccelerator. Exposes utilization and memory usage for both Apple Silicon and Intel GPUs.
Completes GPU monitoring by integrating metrics into build system, instance domains, fetch callbacks, and test suite.
Adds completion tracking showing 19/100 metrics (19%) implemented across Wave 1 and Wave 2. Documents completed work: GPU monitoring, memory compression deep dive, system limits, and process I/O metrics.
Document planned macstat views for GPU, power/battery, and thermal monitoring. Includes ready-to-implement macstat-gpu view plus updates to macstat-x (GPU util) and macstat-mem (compression timing).
Test was using incorrect darwin.* prefixes; actual PMNS defines metrics without the prefix
Route GPU count through CLUSTER_GPU to reach fetch handler
Renames README.md files to CLAUDE.md to align with hierarchical project documentation system. Fixes path typos and adds cross-references.
Three critical constraints now prominently documented: 1. PCP is NOT installed locally - read pmns file instead of running pminfo 2. Git commit required before VM tests - VM clones repo, can't see uncommitted changes 3. Unit tests local, integration tests VM-only Updated files: - src/pmdas/darwin/Claude.md: Add constraint warning box at top - .claude/skills/macos-qa-test/SKILL.md: Fix "When to Use", require commit first - .claude/agents/macos-darwin-pmda-qa.md: Add git status check for uncommitted changes - build/mac/CLAUDE.md: Add constraint box, clarify VM-only integration tests - CLAUDE.md: Add Available Agents section, macOS constraints Agent now refuses to run if uncommitted changes detected in darwin or build/mac dirs.
Root cause: hinv.ngpu registered at wrong cluster (4 vs 19) causing "Unknown metric" errors. Changes: - Fix PMNS cluster for hinv.ngpu (DARWIN:4:99 -> DARWIN:19:99) - Add debug logging to gpu_iokit.c for IOKit enumeration failures - Add debug logging to gpu.c for initialization/refresh tracking - Update integration test to accept 0 GPUs as valid (VM environment) - Improve value extraction regex in test for reliability The VM environment (Tart/GitHub Actions) has no GPU hardware, so 0 GPUs is expected. Debug logs now surface in pmcd.log/darwin.log automatically on test failures.
VM environment has virtual GPU driver (IOAccelerator service) but no actual performance statistics. Test now validates that metrics exist and have correct structure, but accepts missing values as valid in VM context. Fixes bash arithmetic error when util_value is empty.
…purposes). Under a VM like these there's no GPU, so it'll be 0, but a valid value is still good.
Implements the final Wave 1 metrics via sysctl reads: mbuf clusters, max socket buffer size, socket listen backlog, and defunct socket calls. Follows established VFS pattern with dedicated ipc.c/h module wired into pmda.c refresh/fetch cycles.
The ipc metrics were defined but not linked to the root namespace, causing PMNS parsing failures during build.
Mark Category 7.2 complete. Wave 1 now fully implemented: 21 metrics across 14 clusters (202 total Darwin PMDA metrics).
Renamed Claude.md to CLAUDE.md for consistent capitalization. Added critical documentation on PMNS root namespace requirement and VFS-pattern template for adding new metric clusters to prevent "Disconnected subtree" build errors.
Expose 21 new Darwin PMDA metrics through operator-friendly pmrep views. Created macstat-gpu for GPU monitoring, added compression timing to macstat-mem, and quick GPU utilization to macstat-x. Why: Wave 1 metrics exist but lack discoverability - operators need views to use them effectively for performance troubleshooting.
Corrects copy-paste error where disk.apfs.container.bytes_written used PMID 93 instead of 91, causing metric descriptor mismatch. Also updates research doc with Wave 3a completion status: - 77 total Phase 2 metrics now complete (77% of target) - Documents what could not be implemented (queue_depth, inflight, etc.) - Notes IOKit API limitations for certain disk metrics
Documents need for automated validation to prevent PMID mismatches like the Wave 3a bug (bytes_written using PMID 93 instead of 91). Task includes: - Full specification of what to validate - Implementation approach - CI integration points - Success criteria Priority: HIGH - should be done before continuing Wave 3b/4.
Prevent PMID mismatch bugs by validating pmns ↔ metrics.c consistency. Validates: - Every PMID in pmns exists in metrics.c - No duplicate PMIDs within clusters - Cluster enum definitions resolve correctly Catches bugs like Wave 3a disk.apfs.container.bytes_written mismatch (pmns:23:91 vs metrics.c:23:93) at build time instead of runtime.
Run validator after build but before unit tests to catch PMID mismatches early in the build process.
Track per-process network connection counts by inspecting file descriptors and socket info via libproc. Extends FD enumeration to identify IPv4/IPv6 TCP and UDP sockets, enabling network activity monitoring at process granularity.
…vior Desktop/laptop users need thermal visibility for performance diagnosis, especially on Apple Silicon where thermal management is opaque. Implements 13 metrics via SMC and thermal pressure API: - Temperature sensors (CPU/GPU die, package, ambient) - Fan metrics (RPM, target, mode, min/max per-fan) - Thermal pressure level/state (always available, no SMC required) SMC access is community reverse-engineered (not Apple-supported). Code degrades gracefully when unavailable. Completes Phase 2 high-value metrics to 79% → ~85%.
The Darwin PMDA was failing to load with 'Undefined instance domain serial (9)' because FAN_INDOM was declared in the enum but not added to the indomtab array. This caused pmdaInit() to reject the entire PMDA, breaking all Darwin metrics. Add FAN_INDOM entry to indomtab with initial NULL instances (populated dynamically by thermal subsystem).
Critical learnings from thermal implementation: - Instance domains must be added to both darwin.h enum AND pmda.c indomtab array - Missing indomtab entry causes 'Undefined instance domain serial' error - This is the standard PMDA pattern (not Darwin-specific) - Document direct manipulation of it_set/it_numinst (no helper function exists)
Wave 3b thermal monitoring was implemented in commits a89dace and b15b4a3 but the research document was never updated, creating confusion about project status. Updates: - Mark Wave 3b complete with 13 thermal metrics - Update total metrics from 79 to 92 - Update Wave 3 total from 32 to 45 metrics - Update Category 1 (Thermal) to Complete status (13/15) - Update Category 6 (Disk) to Complete status (30/30) - Update overall completion to ~85% (92/99 metrics) - Mark all pmrep views as Ready (thermal, power, gpu all unblocked)
Completes Category 11 (pmrep views) for Darwin PMDA Phase 2. New views expose the power/battery and thermal metrics added in earlier waves to end users via simple pmrep commands.
Merge conflict left CLUSTER_LIMITS with duplicate comment "18", causing all subsequent cluster comments to be off by one. Actual enum values were correct (C auto-increments), but documentation was misleading for anyone reading the code.
Merge conflict fallout caused systematic cluster numbering errors where kernel.limits metrics occupied cluster 18 (should be 19), pushing all subsequent clusters off by one. Runtime metric fetches failed with "Requested metric not defined" errors. Additionally, the PMID consistency validator had a critical bug where error counting happened in a subshell (pipe to while loop), causing it to always report PASSED even when detecting errors. Fixes: - pmns: Update cluster numbers 18-24 to 19-25 for limits/gpu/ipc/power/ipv6/apfs/thermal - pmns: Keep LOGIN metrics (nusers/nroots/nsessions) at cluster 18 - test-pmid-consistency.sh: Replace pipe-to-while with heredoc to fix error counting Validator now correctly fails when PMIDs mismatch, catching these issues early.
PCP doesn't recognize mAh as a unit, causing PM_ERR_CONV errors. Metrics display raw values in milliamp-hours.
Prevent PM_ERR_CONV errors by documenting which units PCP recognizes.
Wave 4 DeferredWave 4 optional metrics have been split out for future consideration in Issue #2484. This PR delivers Waves 1-3 (92 metrics) - the core Phase 2 value. |
natoscott
left a comment
There was a problem hiding this comment.
Generally Claude's doing a good job - quite alot of new code though, I'm hoping you are reviewing it too as you go. A few small nits in the comments.
|
Yes I've been reviewing but I'm no C or PCP expert so I'm not sure I'm really any better than Claude anyway. I've setup tight rules/expectations but it's certainly not perfect sometimes. I do review changes and I try and pre review the PRS as a whole before submission too But any guidance you can give Claude and I will be useful! |
Great - just want to make sure I'm not the only one reviewing the code torrent. :) |
Strip darwin. prefix from metric name comments (they're relative to the PMDA namespace, not the full PCP path). Remove personal dev setup notes from public CLAUDE.md; clarify git add is sufficient before VM tests. Purge interim research/planning docs from the branch.
Summary
Phased expansion of Darwin PMDA with ~100 additional metrics for Apple Silicon Macs, focused on system observability, thermal monitoring, and storage analytics.
Current Progress: 92/100 metrics implemented (~92% complete) 🎯
✅ Wave 1: Quick Wins (COMPLETE)
15 metrics - Low complexity, high value additions
kernel.limits.*for maxproc, maxfiles, vnodesproc.io.logical_writes,proc.memory.footprintCommits: 42f2708, fde9fed, 91c1cb3, abfab40, 20c9d64
✅ Wave 2: Medium Effort (COMPLETE)
32 metrics - Medium complexity, production-grade monitoring
Commits: 0283412, 11d49b8, f0ce125, 68d095a
✅ Wave 3: Higher Effort (COMPLETE - 45/~45 metrics)
Wave 3a: Disk & APFS Statistics (30 metrics)
Extended Disk I/O Metrics (16 metrics)
Per-device and aggregate metrics from
IOBlockStorageDriver:disk.{dev,all}.{read,write}_errorsdisk.{dev,all}.{read,write}_retriesdisk.{dev,all}.total_{read,write}_time(nanoseconds)disk.{dev,all}.avgrq_sz(avg request size),disk.{dev,all}.await(avg wait time)APFS Statistics (14 metrics)
Container and volume metrics via
IOKit:disk.apfs.{ncontainer,nvolume}Implementation Notes:
queue_depth,inflight,utilNOT implemented (IOKit doesn't expose these)Commits: afdf044, 8f2c3a4
Wave 3b: Thermal Monitoring (13 metrics)
SMC-based thermal and fan monitoring with graceful degradation:
thermal.cpu.die,thermal.cpu.proximity,thermal.gpu.die,thermal.package,thermal.ambienthinv.nfan,thermal.fan.{speed,target,mode,min,max}(per-fan instance domain)thermal.pressure.level,thermal.pressure.statePlatform behavior:
hinv.nfan=0(MacBook Air M1/M2, Mac mini M1/M2, Mac Studio base)Commits: a89dace, b15b4a3
Wave 3c: Process Network Connections (2 metrics)
Per-process TCP/UDP socket counts:
proc.net.tcp_count,proc.net.udp_countviaPROC_PIDFDSOCKETINFOenumerationCommit: 55c1740
🔲 Wave 4: Optional/Specialized (DEFERRED)
Wave 4 has been deferred to Issue #2484 for future consideration.
Wave 4 scope includes ~22 metrics across Device Enumeration, Power Consumption (requires root), Scheduler Counters, and Advanced Network statistics. These were deprioritized as they represent specialized/low-value use cases or are blocked by entitlement requirements.
See: Issue #2484 - Darwin PMDA Phase 2 Wave 4
Technical Highlights
New Clusters Added
CLUSTER_GPU(19): GPU device statisticsCLUSTER_IPC(20): IPC resource limitsCLUSTER_POWER(21): Battery & power managementCLUSTER_APFS(23): APFS filesystem statisticsCLUSTER_THERMAL(24): SMC thermal & fan monitoringNew Instance Domains
GPU_INDOM: Per-GPU device metricsAPFS_CONTAINER_INDOM: Per-APFS-container metricsAPFS_VOLUME_INDOM: Per-APFS-volume metricsFAN_INDOM: Per-fan thermal metricsIntegration Test Coverage
Architectural Patterns
Current Metrics Totals
Documentation
See darwin-pmda-phase2-research.md for:
Tracking: Issue #2465