feat(analyzer): integrate dftracer-utils C++ aggregation pipeline with transparent indexing#58
Conversation
There was a problem hiding this comment.
Pull request overview
Integrates the dftracer-utils C++ indexing/aggregation path into DFTracerAnalyzer so callers can use a single analyze_trace() entrypoint with automatic index creation/refresh, and updates tests/fixtures accordingly (gzipped JSONL .pfw.gz, new distributed index test, updated profile/system expectations).
Changes:
- Added transparent index ensuring (and stale-index handling) to the DFTracer analyzer, plus a Dask-distributed scan path that keeps aggregated data on workers.
- Updated hybrid profile and system-metrics tests to write gzipped JSONL traces and adjusted time-bucketing expectations.
- Updated packaging/testing configuration (Python version floor, dftracer-utils source, pytest
--forkeddefault) and added an end-to-end distributed index test.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
python/dftracer/analyzer/dftracer.py |
Reworks DFTracer read/analyze path around dftracer-utils C++ index + Arrow IPC + distributed worker scanning and partial aggregation. |
python/dftracer/analyzer/config.py |
Adds trace_groups config to selectively load subsets from dftracer_organize outputs. |
tests/test_system_metrics.py |
Switches synthetic trace fixture writing to gzipped JSONL .pfw.gz. |
tests/test_hybrid_profiles.py |
Migrates fixtures to .pfw.gz JSONL and updates time_range assertions to be origin-relative. |
tests/test_distributed_index.py |
Adds LocalCluster end-to-end tests for distributed index building and subsequent analyze flow. |
pyproject.toml |
Updates Python requirement, dftracer-utils dependency source, adds pytest --forked default and dev extras. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4149e53 to
3f64ff0
Compare
- Add trace_groups config to filter trace files by group when using dftracer_organize output - Implement distributed index building via Dask with automatic worker plugin registration - Replace legacy read_trace with C++ aggregation pipeline for better performance - Add Arrow-backed data processing with native pandas dtype coercion - Support per-worker IPC data exchange for distributed HLM computation - Enable transparent indexing on first analyze_trace call
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #58 +/- ##
============================================
- Coverage 58.45% 26.37% -32.09%
============================================
Files 27 27
Lines 2903 3667 +764
============================================
- Hits 1697 967 -730
- Misses 1206 2700 +1494 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
this is done @hariharan-devarajan and @izzet Please review it if you guys have time |
hariharan-devarajan
left a comment
There was a problem hiding this comment.
Let's release utils and change to that tag before merging.
| "strenum>=0.4", | ||
| "structlog>=25", | ||
| "dftracer-utils>=0.0.5", | ||
| "dftracer-utils @ git+https://github.com/rayandrew/dftracer-utils.git@feat/dfanalyzer-integration", |
There was a problem hiding this comment.
Let's release utils before we merge
Summary
Integrates the dftracer-utils C++ aggregation pipeline into dfanalyzer as the
default trace-reading path, behind a single transparent
analyze_tracecall.Indexing now happens automatically — callers no longer run a separate index
build step. Verified to produce byte-identical aggregated output to the
pre-existing (Dask/pandas) analyzer.
Depends on
dftracer-utils@feat/dfanalyzer-integration.Motivation
Previously the artifact pipeline had to call an index-build step outside the
normal analyzer flow, which was error-prone (easy to forget) and leaked
indexing concerns into every caller. Separately, the new C++ aggregator path
had correctness gaps versus the old analyzer — event/profile/system counts and
per-row metrics did not line up 1-to-1.
Changes
Transparent single-call API
DFTracerAnalyzer.analyze_trace(trace_path)now ensures the index exists(building it via Dask if missing) before analysis — no separate call needed.
DFTracerAnalyzeronly:trace_groups(selectiveloading from a
dftracer_organizeoutput dir). The baseAnalyzerandAnalyzerConfigare untouched — no API change forDarshan/Recorder.build_index_distributedtrimmed to 6 essential params; Dask pluginregistration is now automatic and idempotent.
_index_path_forresolves the index location next to the traces(
<dir>/.dftindex), handling directory / file / glob inputs.local_directory— no separate knob to configure.Correctness
(
scan_system_metrics_buffer), standardized via_standardize_system.ph="C") events bucket-align to the period theysummarize;
time_start/time_endfollowprofile_time_granularity._coerce_profile_dtypesnormalizes profile output toPROFILE_OUTPUT_COLUMNS(nullable<NA>instead of floatNaN/empty strings); missing columns are filled.
time_rangeis relative to the trace origin (fixes absoluteepoch-bucket indices on real traces).
catis lowercased on read, matching the legacy analyzer.rebuilt when a different
time_granularityis requested.Verification
Old vs new compared in isolated venvs on synthetic traces: all 30 aggregated
groups match on every metric (
count,time,size,time_min/max,size_min/max,offset_min/max) with zero normalization — the new path isbyte-for-byte equal to the old analyzer's aggregated output.
Tests
--forkedis now the default (pytest-forked): each test runs in its ownprocess. Required because the C++ aggregator's string-intern dictionary is a
process-global static; without per-process isolation, indexes built across
tests corrupt each other.
.pfw.gz(the C++ indexer requiresgzipped input).
tests/test_distributed_index.py; updated profile/system expectations.