Skip to content

Fix: add drain_barrier_mask for full-stop barrier in sync_start drain protocol#501

Open
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_sync
Open

Fix: add drain_barrier_mask for full-stop barrier in sync_start drain protocol#501
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_sync

Conversation

@yanghaoran29
Copy link
Copy Markdown
Contributor

Add a second barrier (drain_barrier_mask) between the ack barrier and election in the sync_start drain protocol. This ensures all threads have fully stopped tracker writes before any thread proceeds to election, preventing the race where a non-elected thread may still write to core_trackers_ while the drain worker has exclusive access.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a four-phase drain protocol in aicpu_executor.cpp by adding a full-stop barrier to ensure all threads have finished tracker writes before the election phase. A critical logic error was identified in the implementation of the new barrier: the spin-wait loop checks drain_worker_elected to detect a reset, but since this variable is only set after the loop, the condition is always true, leading to premature returns and potential livelocks. It is recommended to use drain_ack_mask to monitor for resets instead.

… protocol

Add a second barrier (drain_barrier_mask) between the ack barrier and election
in the sync_start drain protocol. This ensures all threads have fully stopped
tracker writes before any thread proceeds to election, preventing the race where
a non-elected thread may still write to core_trackers_ while the drain worker
has exclusive access.

While spinning on drain_barrier_mask, detect reset using drain_ack_mask
(acquire loads; release store when clearing on insufficient resources), not
drain_worker_elected, which remains zero until after the barrier completes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant