Conversation
- Write memory counter values in bytes instead of MB, with `_bytes`
suffix on the args key, so trace viewers display correct values
- Include the process ID in the trace filename
(`comet-event-trace-{pid}.json`) so each executor writes its own
file instead of corrupting a shared one
- Use the actual process ID in trace events instead of hardcoded `1`
- Replace `.expect()` panics with graceful error handling: the writer
becomes `None` on failure so tracing silently disables itself
- Add `Drop` impl to flush the `BufWriter` on shutdown so buffered
events are not lost at process exit
- Replace fragile `ThreadId` debug-string parsing with
`ThreadId::as_u64().get()` (stable since Rust 1.74)
- Rename `CometUnsafeShuffleWriter` trace event to snake_case
`comet_unsafe_shuffle_writer` for consistency
- Fix `"comet_shuffle_"` metric name (trailing underscore, no thread
ID) in `CometBypassMergeSortShuffleWriter` to match the pattern
used in `CometUnsafeShuffleWriter`
- Rename `jvm_heapUsed` metric to `jvm_heap_used` (snake_case)
- Emit `shuffle_spilled_bytes` counter event after each shuffle spill
e9f4b1e to
4482790
Compare
ThreadId::as_u64() is an unstable feature (thread_id_value), which causes compilation failures on stable Rust in CI. Parse the thread ID from ThreadId's Debug format instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # native/core/src/execution/tracing.rs # native/shuffle/src/partitioners/multi_partition.rs
…tual usage Add memory_pool_reserved metric to tracing output alongside jemalloc_allocated, enabling comparison of DataFusion pool accounting against actual native heap. Remove shuffle_spilled_bytes tracing as it is not useful for memory analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…c and logging memory after execution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use thread ID instead of task_attempt_id for memory_pool_reserved metric to get stable IDs (8 threads) instead of globally unique IDs (hundreds). Rename metrics to task_NNN_* pattern so related metrics sort together. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| Traces can be viewed with [Trace Viewer]. | ||
| Traces can be viewed with [Perfetto UI]. | ||
|
|
…c ordering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ory drop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… same thread aggregate into one series Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a global registry of memory pools per Java thread ID. When emitting the comet_memory_reserved tracing metric, sum reserved() across all active pools for that thread instead of reporting a single pool's value. On releasePlan, unregister the pool and emit the remaining total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…or of summed comet_memory_reserved Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arking_lot::Mutex - Pre-compute tracing_memory_metric_name on ExecutionContext to avoid repeated format! allocations - Unify tracing and metrics poll counters into one (poll_count_since_metrics_check) - Combine unregister + total_reserved into single lock acquisition (unregister_and_total) - Switch THREAD_MEMORY_POOLS to parking_lot::Mutex (no poisoning, consistent with codebase) - Remove pointless Recorder::get_thread_id wrapper, call get_thread_id() directly - Extract log_jemalloc_usage helper function Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix filename (comet-event-trace.json, no PID suffix) - Fix pid in example JSON (hardcoded 1, not real PID) - Fix event names: executePlan (camelCase), getNextBatch[JVM] stage=N - Remove non-existent shuffle_spilled_bytes metric - Note jemalloc feature requirement for jemalloc_allocated - Update screenshot with current trace visualization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Java thread IDs with Rust thread IDs throughout the tracing system for consistency. The Chrome trace tid field and metric names now all use the same Rust thread ID. Rename comet_shuffle_NNN to thread_NNN_comet_jvm_shuffle for consistent naming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parses comet-event-trace.json chrome trace logs and reports where jemalloc allocated memory exceeds the sum of per-thread Comet memory pool reservations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…et into improve-tracing
| s"Partitioning $outputPartitioning is not supported.") | ||
| } | ||
|
|
||
| shuffleWriterBuilder.setTracingEnabled(CometConf.COMET_TRACING_ENABLED.get()) |
There was a problem hiding this comment.
This is the bug fix for enabling tracing for shuffle events. We just weren't passing the flag into protobuf before.
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove I love tracing and looking fwd to test it
parthchandra
left a comment
There was a problem hiding this comment.
Very nice! One minor comment.
| .expect("Error parsing thread id") | ||
| } | ||
| pub fn get_thread_id() -> u64 { | ||
| let thread_id = std::thread::current().id(); |
There was a problem hiding this comment.
Claude suggests std::thread::current().id().as_u64() is better.
There was a problem hiding this comment.
Yup, they told me the same. Turns out it is an unstable feature.
mbutrovich
left a comment
There was a problem hiding this comment.
Minor performance concerns, but otherwise looking good!
native/core/src/execution/jni_api.rs
Outdated
| log_memory_usage("jemalloc_allocated", allocated.read().unwrap() as u64); | ||
| } | ||
| } | ||
| let tracing_event_name = exec_context.tracing_event_name.clone(); |
There was a problem hiding this comment.
This is doing a heap allocation and deep-copy of a String on a hot path (per-batch) unconditionally. Can we guard this?
native/core/src/execution/jni_api.rs
Outdated
| ); | ||
| } | ||
|
|
||
| let tracing_event_name = build_tracing_event_name(&spark_plan); |
There was a problem hiding this comment.
Do we need this if tracing isn't enabled? create_plan is less frequent than execute_plan, but maybe still worth guarding.
Avoid heap allocation and plan tree traversal when tracing is disabled. The String clone in executePlan was on the hot per-batch path, and build_tracing_event_name walked the plan tree unconditionally.
mbutrovich
left a comment
There was a problem hiding this comment.
Thanks for the observability and debugging improvements, @andygrove!
|
Merged. Thanks @comphead @parthchandra @mbutrovich ! |
Which issue does this PR close?
N/A
Rationale for this change
Improve tracing feature so that we can compare memory accounting with actual memory usage.
This PR also fixes a bug that was suppressing event tracing for shuffle spill/write. It is also now possible to filter on thread ID in perfetto to see memory aligned with events for a single thread.
An analysis tool is added to parse an event log and report on any memory allocations above the pool size. See docs in this PR for usage.
What changes are included in this PR?
How are these changes tested?
Manually.