[CRITICAL] State restoration fails in Prefetcher due to mutable state aliasing in _populate_queue#1547
[CRITICAL] State restoration fails in Prefetcher due to mutable state aliasing in _populate_queue#1547alexdremov wants to merge 3 commits into
Prefetcher due to mutable state aliasing in _populate_queue#1547Conversation
Use deepcopy for source state_dict to avoid mutation.
|
Hi @alexdremov! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes incorrect Prefetcher state restoration by preventing mutable state aliasing when capturing snapshots from source.state_dict() in the worker loop.
Changes:
- Deep-copies
source.state_dict()before storing snapshots to avoid later mutation by upstream nodes. - Adds
copyimport to support deep copy in_populate_queue.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Refactor snapshot handling to avoid unnecessary deepcopy.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
torchdata/nodes/_populate_queue.py:81
- This change addresses a concurrency/aliasing issue but the existing Prefetcher save/load tests use nodes whose
get_state()returns fresh dicts, so they won’t catch nested-mutable aliasing regressions. Adding a regression test with a node that returns a shared mutable structure fromget_state()(like the reproduction in the PR description) would help ensure snapshots remain stable even while the worker thread continues prefetching.
if snapshot_frequency > 0 and yielded % snapshot_frequency == 0:
snapshot = source.state_dict()
if snapshot is not None:
snapshot = copy.deepcopy(snapshot)
I encountered an issue where data pipeline restoration does not work correctly when using
Prefetcher. I identified a temporal aliasing bug (a code-order race condition) regardingstate_dict()manipulation.torchdatadoes not strictly enforce immutability onget_state(), meaning nodes can return nested mutable structures (lists, dicts, etc.). Because the snapshot is not deeply copied, it is mutated in place as the pipeline continues to yield items.Reproduction Steps
In
_populate_queue, a single worker thread executes the following loop sequentially:next(source).snapshot = source.state_dict(), which exposes references to the internal states of upstream nodes (lists, dicts, buffers, etc.).snapshotinto theSnapshotStore.next(source)again. [!] Internal buffers of the underlying nodes get updated, which silently mutates the referenced state already sitting in theSnapshotStore.SnapshotStore(e.g., by the main thread), it is inconsistent and reflects future states.This leads to the inability to resume any complex pipeline that relies on
Prefetcherand stateful upstream nodes.Proposed Solution
The safest and most robust fix is to enforce a deep copy when taking the snapshot before placing it in the store. Updating step 2 to:
This isolates the saved state and prevents the worker thread's subsequent
next(source)calls from corrupting the stored snapshot.Reproduction
This reproduces the bug on the latest release but not with the fix.