feat(analyzer): integrate dftracer-utils C++ aggregation pipeline with transparent indexing by rayandrew · Pull Request #58 · llnl/dfanalyzer

rayandrew · 2026-05-20T13:49:57Z

Summary

Integrates the dftracer-utils C++ aggregation pipeline into dfanalyzer as the
default trace-reading path, behind a single transparent analyze_trace call.
Indexing now happens automatically — callers no longer run a separate index
build step. Verified to produce byte-identical aggregated output to the
pre-existing (Dask/pandas) analyzer.

Depends on dftracer-utils@feat/dfanalyzer-integration.

Motivation

Previously the artifact pipeline had to call an index-build step outside the
normal analyzer flow, which was error-prone (easy to forget) and leaked
indexing concerns into every caller. Separately, the new C++ aggregator path
had correctness gaps versus the old analyzer — event/profile/system counts and
per-row metrics did not line up 1-to-1.

Changes

Transparent single-call API

DFTracerAnalyzer.analyze_trace(trace_path) now ensures the index exists
(building it via Dask if missing) before analysis — no separate call needed.
New optional kwargs on DFTracerAnalyzer only: trace_groups (selective
loading from a dftracer_organize output dir). The base Analyzer and
AnalyzerConfig are untouched — no API change for Darshan/Recorder.
build_index_distributed trimmed to 6 essential params; Dask plugin
registration is now automatic and idempotent.
_index_path_for resolves the index location next to the traces
(<dir>/.dftindex), handling directory / file / glob inputs.
Node-local SST staging is derived from each Dask worker's own
local_directory — no separate knob to configure.

Correctness

System metrics: surfaced through the C++ path
(scan_system_metrics_buffer), standardized via _standardize_system.
Profiles: counter (ph="C") events bucket-align to the period they
summarize; time_start/time_end follow profile_time_granularity.
dtypes: _coerce_profile_dtypes normalizes profile output to
PROFILE_OUTPUT_COLUMNS (nullable <NA> instead of float NaN /
empty strings); missing columns are filled.
time_range is relative to the trace origin (fixes absolute
epoch-bucket indices on real traces).
cat is lowercased on read, matching the legacy analyzer.
Stale index: an index built at one aggregation interval is dropped and
rebuilt when a different time_granularity is requested.

Verification

Old vs new compared in isolated venvs on synthetic traces: all 30 aggregated
groups match on every metric (count, time, size, time_min/max,
size_min/max, offset_min/max) with zero normalization — the new path is
byte-for-byte equal to the old analyzer's aggregated output.

Tests

--forked is now the default (pytest-forked): each test runs in its own
process. Required because the C++ aggregator's string-intern dictionary is a
process-global static; without per-process isolation, indexes built across
tests corrupt each other.
Test fixtures converted to gzipped .pfw.gz (the C++ indexer requires
gzipped input).
New tests/test_distributed_index.py; updated profile/system expectations.

Copilot

Pull request overview

Integrates the dftracer-utils C++ indexing/aggregation path into DFTracerAnalyzer so callers can use a single analyze_trace() entrypoint with automatic index creation/refresh, and updates tests/fixtures accordingly (gzipped JSONL .pfw.gz, new distributed index test, updated profile/system expectations).

Changes:

Added transparent index ensuring (and stale-index handling) to the DFTracer analyzer, plus a Dask-distributed scan path that keeps aggregated data on workers.
Updated hybrid profile and system-metrics tests to write gzipped JSONL traces and adjusted time-bucketing expectations.
Updated packaging/testing configuration (Python version floor, dftracer-utils source, pytest --forked default) and added an end-to-end distributed index test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`python/dftracer/analyzer/dftracer.py`	Reworks DFTracer read/analyze path around dftracer-utils C++ index + Arrow IPC + distributed worker scanning and partial aggregation.
`python/dftracer/analyzer/config.py`	Adds `trace_groups` config to selectively load subsets from `dftracer_organize` outputs.
`tests/test_system_metrics.py`	Switches synthetic trace fixture writing to gzipped JSONL `.pfw.gz`.
`tests/test_hybrid_profiles.py`	Migrates fixtures to `.pfw.gz` JSONL and updates time_range assertions to be origin-relative.
`tests/test_distributed_index.py`	Adds LocalCluster end-to-end tests for distributed index building and subsequent analyze flow.
`pyproject.toml`	Updates Python requirement, dftracer-utils dependency source, adds pytest `--forked` default and dev extras.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add trace_groups config to filter trace files by group when using dftracer_organize output - Implement distributed index building via Dask with automatic worker plugin registration - Replace legacy read_trace with C++ aggregation pipeline for better performance - Add Arrow-backed data processing with native pandas dtype coercion - Support per-worker IPC data exchange for distributed HLM computation - Enable transparent indexing on first analyze_trace call

codecov · 2026-05-20T16:59:15Z

Codecov Report

❌ Patch coverage is 7.04607% with 686 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.37%. Comparing base (214f8ac) to head (97d5c7f).
⚠️ Report is 16 commits behind head on develop.

Files with missing lines	Patch %	Lines
python/dftracer/analyzer/dftracer.py	6.91%	686 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           develop      #58       +/-   ##
============================================
- Coverage    58.45%   26.37%   -32.09%     
============================================
  Files           27       27               
  Lines         2903     3667      +764     
============================================
- Hits          1697      967      -730     
- Misses        1206     2700     +1494

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rayandrew · 2026-05-21T00:04:18Z

this is done @hariharan-devarajan and @izzet
don't merge first since i need to make tagged release in dftracer-utils.

Please review it if you guys have time

hariharan-devarajan

Let's release utils and change to that tag before merging.

hariharan-devarajan · 2026-05-21T00:05:38Z

    "strenum>=0.4",
    "structlog>=25",
-    "dftracer-utils>=0.0.5",
+    "dftracer-utils @ git+https://github.com/rayandrew/dftracer-utils.git@feat/dfanalyzer-integration",


Let's release utils before we merge

rayandrew requested a review from Copilot May 20, 2026 13:50

Copilot started reviewing on behalf of rayandrew May 20, 2026 13:50 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

rayandrew force-pushed the perf/dftracer-utils-integration branch from 4149e53 to 3f64ff0 Compare May 20, 2026 15:00

rayandrew added 2 commits May 20, 2026 10:00

test(data): update dftracer-system tarball with new test data

97d5c7f

rayandrew requested review from hariharan-devarajan and izzet May 21, 2026 00:03

hariharan-devarajan requested changes May 21, 2026

View reviewed changes

refactor(analyzer): move utility functions to dfanalyzer module

ce7b4b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(analyzer): integrate dftracer-utils C++ aggregation pipeline with transparent indexing#58

feat(analyzer): integrate dftracer-utils C++ aggregation pipeline with transparent indexing#58
rayandrew wants to merge 3 commits into
llnl:developfrom
rayandrew:perf/dftracer-utils-integration

rayandrew commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

rayandrew commented May 21, 2026

Uh oh!

hariharan-devarajan left a comment

Uh oh!

hariharan-devarajan May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rayandrew commented May 20, 2026

Summary

Motivation

Changes

Transparent single-call API

Correctness

Verification

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

rayandrew commented May 21, 2026

Uh oh!

hariharan-devarajan left a comment

Choose a reason for hiding this comment

Uh oh!

hariharan-devarajan May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants