Skip to content

[UR][L0] Fix barrier event cleanup on urEventRelease#21340

Open
winstonzhang-intel wants to merge 1 commit intointel:syclfrom
winstonzhang-intel:urlza-710
Open

[UR][L0] Fix barrier event cleanup on urEventRelease#21340
winstonzhang-intel wants to merge 1 commit intointel:syclfrom
winstonzhang-intel:urlza-710

Conversation

@winstonzhang-intel
Copy link
Contributor

Repeated barrier/event-wait submissions can accumulate Level Zero event pools when an event is released before its completion flag is set. In that case, the release path may skip the completed-event cleanup and keep queue-held references alive until later command-list recycling.

Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER explicitly in urEventRelease: if completion is not yet marked, query the host-visible event status and, when already signaled, mark completed and run the existing cleanup path immediately.

This makes completed barrier/wait events release resources promptly and avoids unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.

Repeated barrier/event-wait submissions can accumulate Level Zero event pools
when an event is released before its completion flag is set. In that case,
the release path may skip the completed-event cleanup and keep queue-held
references alive until later command-list recycling.

Handle UR_COMMAND_EVENTS_WAIT and UR_COMMAND_EVENTS_WAIT_WITH_BARRIER
explicitly in urEventRelease: if completion is not yet marked, query the
host-visible event status and, when already signaled, mark completed and run
the existing cleanup path immediately.

This makes completed barrier/wait events release resources promptly and avoids
unbounded zeEventPoolCreate growth in long-running barrier-heavy workloads.

Signed-off-by: Zhang, Winston <winston.zhang@intel.com>
Copy link

@aravindksg aravindksg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +900 to +915
if (isEventsWaitCommand && !isEventDeleted && !isEventsWaitCompleted) {
ze_result_t QueryRes = ZE_RESULT_NOT_READY;
{
std::shared_lock<ur_shared_mutex> EventLock(Event->Mutex);
auto HostVisibleEvent = Event->HostVisibleEvent;
if (checkL0LoaderTeardown() && HostVisibleEvent) {
QueryRes =
ZE_CALL_NOCHECK(zeEventQueryStatus, (HostVisibleEvent->ZeEvent));
}
}

if (QueryRes == ZE_RESULT_SUCCESS) {
isEventsWaitCompleted = true;
Event->Completed = true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix a memory leak? What happens if the event is not complete by the time zeEventQueryStatus is called? The event leaks?

The way I understand it works, these events are in the command list's EventList, which is then scanned periodically, on subsequent operations. Does the event get stuck there?

I guess I don't understand why this is needed. It seems inherently race'y to me, and all it would do is deallocate the event a little bit earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch fixes a cleanup timing gap. Namely, before if urEventRelease ran before Event->Completed was set, it skipped the barrier/wait cleanup path, so queue-held refs could hang around much longer than intended.
With my change, now it does a quick zeEventQueryStatus; if the event is already done, it marks complete and cleans up immediately.

In the case that zeEventQueryStatus says not ready, it does not force-delete anything. The event stays owned by queue/command-list tracking and should be reclaimed later by normal recycle/sync paths.

So yes, it can stay in the command list EventList temporarily. The issue this commit addresses is that in barrier-heavy workloads, “temporary” was causing unbounded growth in practice.

In terms of race conditions, this is mostly a best-effort fast path. Worst case it misses completion and cleanup is deferred; it shouldn’t double-free because refcount/cleanup guards still control destruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants