Skip to content

fix: improve tracing feature#3688

Merged
andygrove merged 32 commits intoapache:mainfrom
andygrove:improve-tracing
Apr 10, 2026
Merged

fix: improve tracing feature#3688
andygrove merged 32 commits intoapache:mainfrom
andygrove:improve-tracing

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 13, 2026

Which issue does this PR close?

N/A

Rationale for this change

Improve tracing feature so that we can compare memory accounting with actual memory usage.

tracing

This PR also fixes a bug that was suppressing event tracing for shuffle spill/write. It is also now possible to filter on thread ID in perfetto to see memory aligned with events for a single thread.

Screenshot 2026-04-10 at 7 58 34 AM

An analysis tool is added to parse an event log and report on any memory allocations above the pool size. See docs in this PR for usage.

cargo run --bin analyze_trace -- /opt/spark-3.5.6-bin-hadoop3/work/app-20260410074249-0000/0/comet-event-trace.json
=== Comet Trace Memory Analysis ===

Counter events parsed: 213566
Threads with memory pools: 8
Peak jemalloc allocated:   2205.9 MB
Peak pool total:           2021.6 MB
Peak excess (jemalloc - pool): 363.6 MB

WARNING: jemalloc exceeded pool reservation at 161 sampled points:

     Time (us)        jemalloc      pool_total          excess
--------------------------------------------------------------
        177734        213.1 MB          0.1 MB        213.0 MB
...
       1092197       1998.5 MB       1675.1 MB        323.4 MB
       1158467       2159.8 MB       1836.3 MB        323.5 MB

What changes are included in this PR?

  • Many improvements to trace logging
  • Analysis tool

How are these changes tested?

Manually.

- Write memory counter values in bytes instead of MB, with `_bytes`
  suffix on the args key, so trace viewers display correct values
- Include the process ID in the trace filename
  (`comet-event-trace-{pid}.json`) so each executor writes its own
  file instead of corrupting a shared one
- Use the actual process ID in trace events instead of hardcoded `1`
- Replace `.expect()` panics with graceful error handling: the writer
  becomes `None` on failure so tracing silently disables itself
- Add `Drop` impl to flush the `BufWriter` on shutdown so buffered
  events are not lost at process exit
- Replace fragile `ThreadId` debug-string parsing with
  `ThreadId::as_u64().get()` (stable since Rust 1.74)
- Rename `CometUnsafeShuffleWriter` trace event to snake_case
  `comet_unsafe_shuffle_writer` for consistency
- Fix `"comet_shuffle_"` metric name (trailing underscore, no thread
  ID) in `CometBypassMergeSortShuffleWriter` to match the pattern
  used in `CometUnsafeShuffleWriter`
- Rename `jvm_heapUsed` metric to `jvm_heap_used` (snake_case)
- Emit `shuffle_spilled_bytes` counter event after each shuffle spill
@andygrove andygrove marked this pull request as ready for review March 18, 2026 12:47
andygrove and others added 7 commits April 9, 2026 12:49
ThreadId::as_u64() is an unstable feature (thread_id_value), which
causes compilation failures on stable Rust in CI. Parse the thread
ID from ThreadId's Debug format instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	native/core/src/execution/tracing.rs
#	native/shuffle/src/partitioners/multi_partition.rs
…tual usage

Add memory_pool_reserved metric to tracing output alongside jemalloc_allocated,
enabling comparison of DataFusion pool accounting against actual native heap.
Remove shuffle_spilled_bytes tracing as it is not useful for memory analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…c and logging memory after execution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use thread ID instead of task_attempt_id for memory_pool_reserved metric
to get stable IDs (8 threads) instead of globally unique IDs (hundreds).
Rename metrics to task_NNN_* pattern so related metrics sort together.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andygrove andygrove changed the title fix: improve Chrome trace event format correctness and metric naming fix: improve memory tracing Apr 9, 2026

Traces can be viewed with [Trace Viewer].
Traces can be viewed with [Perfetto UI].

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

andygrove and others added 10 commits April 9, 2026 14:44
…c ordering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ory drop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… same thread aggregate into one series

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a global registry of memory pools per Java thread ID. When emitting
the comet_memory_reserved tracing metric, sum reserved() across all
active pools for that thread instead of reporting a single pool's value.
On releasePlan, unregister the pool and emit the remaining total.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…or of summed comet_memory_reserved

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arking_lot::Mutex

- Pre-compute tracing_memory_metric_name on ExecutionContext to avoid repeated format! allocations
- Unify tracing and metrics poll counters into one (poll_count_since_metrics_check)
- Combine unregister + total_reserved into single lock acquisition (unregister_and_total)
- Switch THREAD_MEMORY_POOLS to parking_lot::Mutex (no poisoning, consistent with codebase)
- Remove pointless Recorder::get_thread_id wrapper, call get_thread_id() directly
- Extract log_jemalloc_usage helper function

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix filename (comet-event-trace.json, no PID suffix)
- Fix pid in example JSON (hardcoded 1, not real PID)
- Fix event names: executePlan (camelCase), getNextBatch[JVM] stage=N
- Remove non-existent shuffle_spilled_bytes metric
- Note jemalloc feature requirement for jemalloc_allocated
- Update screenshot with current trace visualization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andygrove andygrove changed the title fix: improve memory tracing fix: improve tracing feature Apr 10, 2026
andygrove and others added 2 commits April 10, 2026 06:45
Replace Java thread IDs with Rust thread IDs throughout the tracing
system for consistency. The Chrome trace tid field and metric names
now all use the same Rust thread ID. Rename comet_shuffle_NNN to
thread_NNN_comet_jvm_shuffle for consistent naming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
andygrove and others added 8 commits April 10, 2026 07:25
Parses comet-event-trace.json chrome trace logs and reports where
jemalloc allocated memory exceeds the sum of per-thread Comet memory
pool reservations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
s"Partitioning $outputPartitioning is not supported.")
}

shuffleWriterBuilder.setTracingEnabled(CometConf.COMET_TRACING_ENABLED.get())
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix for enabling tracing for shuffle events. We just weren't passing the flag into protobuf before.

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove I love tracing and looking fwd to test it

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! One minor comment.

.expect("Error parsing thread id")
}
pub fn get_thread_id() -> u64 {
let thread_id = std::thread::current().id();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggests std::thread::current().id().as_u64() is better.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, they told me the same. Turns out it is an unstable feature.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor performance concerns, but otherwise looking good!

log_memory_usage("jemalloc_allocated", allocated.read().unwrap() as u64);
}
}
let tracing_event_name = exec_context.tracing_event_name.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing a heap allocation and deep-copy of a String on a hot path (per-batch) unconditionally. Can we guard this?

);
}

let tracing_event_name = build_tracing_event_name(&spark_plan);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if tracing isn't enabled? create_plan is less frequent than execute_plan, but maybe still worth guarding.

Avoid heap allocation and plan tree traversal when tracing is disabled.
The String clone in executePlan was on the hot per-batch path, and
build_tracing_event_name walked the plan tree unconditionally.
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the observability and debugging improvements, @andygrove!

@andygrove andygrove merged commit 204d7e4 into apache:main Apr 10, 2026
132 checks passed
@andygrove andygrove deleted the improve-tracing branch April 10, 2026 22:47
@andygrove
Copy link
Copy Markdown
Member Author

Merged. Thanks @comphead @parthchandra @mbutrovich !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants