[UR][L0] Fix barrier event cleanup on urEventRelease#21340
[UR][L0] Fix barrier event cleanup on urEventRelease#21340winstonzhang-intel wants to merge 1 commit intointel:syclfrom
Conversation
Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling. Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately. This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads. Signed-off-by: Zhang, Winston <winston.zhang@intel.com>
1eac543 to
0eb6d24
Compare
| if (isEventsWaitCommand && !isEventDeleted && !isEventsWaitCompleted) { | ||
| ze_result_t QueryRes = ZE_RESULT_NOT_READY; | ||
| { | ||
| std::shared_lock<ur_shared_mutex> EventLock(Event->Mutex); | ||
| auto HostVisibleEvent = Event->HostVisibleEvent; | ||
| if (checkL0LoaderTeardown() && HostVisibleEvent) { | ||
| QueryRes = | ||
| ZE_CALL_NOCHECK(zeEventQueryStatus, (HostVisibleEvent->ZeEvent)); | ||
| } | ||
| } | ||
|
|
||
| if (QueryRes == ZE_RESULT_SUCCESS) { | ||
| isEventsWaitCompleted = true; | ||
| Event->Completed = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Does this fix a memory leak? What happens if the event is not complete by the time zeEventQueryStatus is called? The event leaks?
The way I understand it works, these events are in the command list's EventList, which is then scanned periodically, on subsequent operations. Does the event get stuck there?
I guess I don't understand why this is needed. It seems inherently race'y to me, and all it would do is deallocate the event a little bit earlier.
There was a problem hiding this comment.
This patch fixes a cleanup timing gap. Namely, before if urEventRelease ran before Event->Completed was set, it skipped the barrier/wait cleanup path, so queue-held refs could hang around much longer than intended.
With my change, now it does a quick zeEventQueryStatus; if the event is already done, it marks complete and cleans up immediately.
In the case that zeEventQueryStatus says not ready, it does not force-delete anything. The event stays owned by queue/command-list tracking and should be reclaimed later by normal recycle/sync paths.
So yes, it can stay in the command list EventList temporarily. The issue this commit addresses is that in barrier-heavy workloads, “temporary” was causing unbounded growth in practice.
In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.
Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling.
Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately.
This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.