Skip to content

feat(analyzer): integrate dftracer-utils C++ aggregation pipeline with transparent indexing#58

Open
rayandrew wants to merge 3 commits into
llnl:developfrom
rayandrew:perf/dftracer-utils-integration
Open

feat(analyzer): integrate dftracer-utils C++ aggregation pipeline with transparent indexing#58
rayandrew wants to merge 3 commits into
llnl:developfrom
rayandrew:perf/dftracer-utils-integration

Conversation

@rayandrew
Copy link
Copy Markdown
Collaborator

Summary

Integrates the dftracer-utils C++ aggregation pipeline into dfanalyzer as the
default trace-reading path, behind a single transparent analyze_trace call.
Indexing now happens automatically — callers no longer run a separate index
build step. Verified to produce byte-identical aggregated output to the
pre-existing (Dask/pandas) analyzer.

Depends on dftracer-utils@feat/dfanalyzer-integration.

Motivation

Previously the artifact pipeline had to call an index-build step outside the
normal analyzer flow, which was error-prone (easy to forget) and leaked
indexing concerns into every caller. Separately, the new C++ aggregator path
had correctness gaps versus the old analyzer — event/profile/system counts and
per-row metrics did not line up 1-to-1.

Changes

Transparent single-call API

  • DFTracerAnalyzer.analyze_trace(trace_path) now ensures the index exists
    (building it via Dask if missing) before analysis — no separate call needed.
  • New optional kwargs on DFTracerAnalyzer only: trace_groups (selective
    loading from a dftracer_organize output dir). The base Analyzer and
    AnalyzerConfig are untouched — no API change for Darshan/Recorder.
  • build_index_distributed trimmed to 6 essential params; Dask plugin
    registration is now automatic and idempotent.
  • _index_path_for resolves the index location next to the traces
    (<dir>/.dftindex), handling directory / file / glob inputs.
  • Node-local SST staging is derived from each Dask worker's own
    local_directory — no separate knob to configure.

Correctness

  • System metrics: surfaced through the C++ path
    (scan_system_metrics_buffer), standardized via _standardize_system.
  • Profiles: counter (ph="C") events bucket-align to the period they
    summarize; time_start/time_end follow profile_time_granularity.
  • dtypes: _coerce_profile_dtypes normalizes profile output to
    PROFILE_OUTPUT_COLUMNS (nullable <NA> instead of float NaN /
    empty strings); missing columns are filled.
  • time_range is relative to the trace origin (fixes absolute
    epoch-bucket indices on real traces).
  • cat is lowercased on read, matching the legacy analyzer.
  • Stale index: an index built at one aggregation interval is dropped and
    rebuilt when a different time_granularity is requested.

Verification

Old vs new compared in isolated venvs on synthetic traces: all 30 aggregated
groups match on every metric (count, time, size, time_min/max,
size_min/max, offset_min/max) with zero normalization — the new path is
byte-for-byte equal to the old analyzer's aggregated output.

Tests

  • --forked is now the default (pytest-forked): each test runs in its own
    process. Required because the C++ aggregator's string-intern dictionary is a
    process-global static; without per-process isolation, indexes built across
    tests corrupt each other.
  • Test fixtures converted to gzipped .pfw.gz (the C++ indexer requires
    gzipped input).
  • New tests/test_distributed_index.py; updated profile/system expectations.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates the dftracer-utils C++ indexing/aggregation path into DFTracerAnalyzer so callers can use a single analyze_trace() entrypoint with automatic index creation/refresh, and updates tests/fixtures accordingly (gzipped JSONL .pfw.gz, new distributed index test, updated profile/system expectations).

Changes:

  • Added transparent index ensuring (and stale-index handling) to the DFTracer analyzer, plus a Dask-distributed scan path that keeps aggregated data on workers.
  • Updated hybrid profile and system-metrics tests to write gzipped JSONL traces and adjusted time-bucketing expectations.
  • Updated packaging/testing configuration (Python version floor, dftracer-utils source, pytest --forked default) and added an end-to-end distributed index test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
python/dftracer/analyzer/dftracer.py Reworks DFTracer read/analyze path around dftracer-utils C++ index + Arrow IPC + distributed worker scanning and partial aggregation.
python/dftracer/analyzer/config.py Adds trace_groups config to selectively load subsets from dftracer_organize outputs.
tests/test_system_metrics.py Switches synthetic trace fixture writing to gzipped JSONL .pfw.gz.
tests/test_hybrid_profiles.py Migrates fixtures to .pfw.gz JSONL and updates time_range assertions to be origin-relative.
tests/test_distributed_index.py Adds LocalCluster end-to-end tests for distributed index building and subsequent analyze flow.
pyproject.toml Updates Python requirement, dftracer-utils dependency source, adds pytest --forked default and dev extras.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/dftracer/analyzer/dftracer.py
Comment thread python/dftracer/analyzer/dftracer.py Outdated
Comment thread python/dftracer/analyzer/dftracer.py Outdated
Comment thread python/dftracer/analyzer/dftracer.py Outdated
Comment thread python/dftracer/analyzer/dftracer.py Outdated
Comment thread python/dftracer/analyzer/dftracer.py Outdated
Comment thread python/dftracer/analyzer/dftracer.py
Comment thread pyproject.toml
Comment thread pyproject.toml
@rayandrew rayandrew force-pushed the perf/dftracer-utils-integration branch from 4149e53 to 3f64ff0 Compare May 20, 2026 15:00
rayandrew added 2 commits May 20, 2026 10:00
- Add trace_groups config to filter trace files by group when using dftracer_organize output
- Implement distributed index building via Dask with automatic worker plugin registration
- Replace legacy read_trace with C++ aggregation pipeline for better performance
- Add Arrow-backed data processing with native pandas dtype coercion
- Support per-worker IPC data exchange for distributed HLM computation
- Enable transparent indexing on first analyze_trace call
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 7.04607% with 686 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.37%. Comparing base (214f8ac) to head (97d5c7f).
⚠️ Report is 16 commits behind head on develop.

Files with missing lines Patch % Lines
python/dftracer/analyzer/dftracer.py 6.91% 686 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop      #58       +/-   ##
============================================
- Coverage    58.45%   26.37%   -32.09%     
============================================
  Files           27       27               
  Lines         2903     3667      +764     
============================================
- Hits          1697      967      -730     
- Misses        1206     2700     +1494     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rayandrew
Copy link
Copy Markdown
Collaborator Author

this is done @hariharan-devarajan and @izzet
don't merge first since i need to make tagged release in dftracer-utils.

Please review it if you guys have time

Copy link
Copy Markdown
Member

@hariharan-devarajan hariharan-devarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's release utils and change to that tag before merging.

Comment thread pyproject.toml
"strenum>=0.4",
"structlog>=25",
"dftracer-utils>=0.0.5",
"dftracer-utils @ git+https://github.com/rayandrew/dftracer-utils.git@feat/dfanalyzer-integration",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's release utils before we merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants