Add XCCL collective communication activity tracing to XPU plugin#1396
Add XCCL collective communication activity tracing to XPU plugin#1396tsocha wants to merge 3 commits into
Conversation
Enable PTI_VIEW_COMMUNICATION collection in the XPU PTI plugin so oneCCL host-side collective operations show up in Kineto traces. Events are emitted as COLLECTIVE_COMM activities named with an "xccl::" prefix and carry the PTI communicator id as metadata. - Gate new code paths on PTI_VERSION_AT_LEAST(0, 17) - Wire enable/disable of PTI_VIEW_COMMUNICATION in XpuptiActivityApi - Add handleCommunicationActivity for pti_view_record_comms records - Add unit tests covering naming, field mapping, and out-of-range drop - Document INTEL_LIBITTNOTIFY64 requirement in libkineto/README.md
|
@gujinghui please review it. |
| #if PTI_VERSION_AT_LEAST(0, 17) | ||
| case ActivityType::COLLECTIVE_COMM: { | ||
| auto rc = ptiViewEnable(PTI_VIEW_COMMUNICATION); | ||
| if (rc != PTI_SUCCESS) { |
There was a problem hiding this comment.
Why we do not follow the existing code style to use XPUPTI_CALL macro?
There was a problem hiding this comment.
XPUPTI_CALL macro throw an error.
I wanted to Log a WARNING because oneCCL is not supported on Windows.
I think that the standard XPUPTI_CALL can be used. I will fix it.
| The default trace output is a JSON file that can be visualized in Chrome Trace Viewer or Perfetto. The trace output is generated by the `ChromeTraceLogger` instance. The `ChromeTraceLogger` writes to a JSON file using `std::ofstream` in `output_json.cpp` to maximize performance during export. This instance is created by the `ActivityProfilerController` and is stored in the `ActivityLoggerFactory` alongside its protocol. Using this schema, Kineto supports multiple trace output formats. | ||
|
|
||
| - Intel XCCL: to enable collecting of oneCCL host events, `INTEL_LIBITTNOTIFY64` enviroment variable have to be set as path to `pti_view.so` location. | ||
|
|
There was a problem hiding this comment.
Why does this need to be in the general instructions? Is this something that's covered in Intel's other docs? I'd like to keep this file short and general.
There was a problem hiding this comment.
This env variable is required by ITT which is used by PTI to collect these events.
Without this variable user won't see oneCCL events in his trace.
I wanted to expose this information to avoid confusion of the Kineto user.
We are working to remove this requirement in the future but due to performance requirements of PTI integration it's not ready yet.
I could create a new README file inside xpupti plugin directory but I'm afraid that it will be hidden.
What do you think?
It's a part of #1335 3/3
Enable PTI_VIEW_COMMUNICATION collection in the XPU PTI plugin so oneCCL host-side collective operations show up in Kineto traces. Events are emitted as COLLECTIVE_COMM activities named with an "xccl::" prefix and carry the PTI communicator id as metadata.