Fix broken WSLCorePort channel after receive timeout#14455
Fix broken WSLCorePort channel after receive timeout#14455chemwolf6922 wants to merge 14 commits intomicrosoft:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a broken channel state that occurs after a transaction timeout in WSL's socket-based IPC protocol. The issue (#14193, #14055) manifests after laptop sleep/hibernate, where a channel's expected sequence number gets desynchronized, causing all subsequent communication to fail until wsl --shutdown.
Changes:
- Replace independent sender/receiver sequence counters with an echo-back mechanism: the responder echoes back the request's sequence number in its reply, preventing desync after timeouts.
- Add a magic number field to
MESSAGE_HEADERfor early framing corruption detection, and skip stale (timed-out) replies in the receive loop. - Zero-initialize a
Replyunion inbinfmt.cppto ensure the newMessageMagicdefault initializer doesn't cause issues with rawread()calls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/shared/inc/lxinitshared.h |
Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for new struct size. |
src/shared/inc/SocketChannel.h |
Rewrote send/receive sequence logic to echo-back model; added stale message skipping loop; replaced m_received_messages with m_expected_reply_sequence / m_pending_reply_sequence. |
src/shared/inc/socketshared.h |
Added magic number validation in RecvMessage before processing header. |
src/linux/init/binfmt.cpp |
Zero-initialized Reply union to handle new MessageMagic default member initializer. |
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a broken channel state issue in WSL's SocketChannel that occurs after a transaction timeout (e.g., when resuming from sleep). Previously, a timeout would increment the expected message ID on the receiver side, but the sender wouldn't use that incremented ID, causing a permanent ID desync and locking the channel. The fix replaces independent sequence tracking with an echo-back mechanism where the responder echoes back the request's sequence number in its reply, and the requester skips stale replies from previously timed-out transactions.
Changes:
- Added a magic number field to
MESSAGE_HEADERand validated it inRecvMessageto detect framing corruption early. - Replaced independent sequence counters with an echo-back sequence mechanism in
SocketChannelusingm_expected_reply_sequenceandm_pending_reply_sequence, with a loop to skip stale replies. - Zero-initialized a union in
binfmt.cppto ensure the newMessageMagicfield is properly initialized when reading responses.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/shared/inc/lxinitshared.h |
Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for LX_GNS_SET_PORT_LISTENER size. |
src/shared/inc/socketshared.h |
Added magic number validation in RecvMessage after reading the header. |
src/shared/inc/SocketChannel.h |
Replaced send/receive sequence tracking with echo-back mechanism; added stale-reply skip loop; removed sequence parameter from ValidateMessageHeader. |
src/linux/init/binfmt.cpp |
Zero-initialized Reply union to ensure MessageMagic defaults correctly. |
You can also share your feedback on Copilot code review. Take the survey.
…om:chemwolf6922/WSL into fix-broken-state-after-transaction-timeout
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent SocketChannel protocol desynchronization after a transaction timeout (the “expected sequence” advancing while a late response with the previous sequence arrives), which can leave a channel in a permanently broken state.
Changes:
- Reworks SocketChannel sequencing to echo request sequence numbers in replies and discard stale replies.
- Adds new per-channel state (
m_expected_reply_sequence/m_pending_reply_sequence) to track request/reply sequencing. - Updates protocol error handling to validate type separately from sequencing.
Is there a possibility to have a unit test that specifically covers timeout-then-recover scenario -- basically simulate a timed-out receive follwed by a successful send/receive? |
@ptrivedi The WSL project is lacking test infrastructures for internal components like this. The Windows tests are e2e tests. And the Linux tests are syscall tests. There might need to be some major changes to make this happen. |

Summary of the Pull Request
This pattern shows up in multiple sleep -> wake -> wsl stuck reports:

In the current sequence number logic, the receive sequence will ++ without receiving any message. If timeout is allowed on the channel and it's not destroyed, the next receive will always get the N-1 message due to:
This will lock the channel in an unusable state.
This PR makes these changes to keep the WSLCorePort channels working after a receive timeout.
These are only applied to the WSLCorePort channels to reduce risk. Though other channels may face the same problem.
PR Checklist
Closes: WSL2 crashes on waking up from sleep #14193 WSL 2.6.3.0: Terminal crash after hibernation/sleep with [process exited with code 1] #14055 Error code: Wsl/Service/E_UNEXPECTED #14014
Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
Tests: Added/updated if needed and all pass
All but 8 unit test fails. Where 6 of them failed because of GPO settings or powershell issues. 2 of them (CGroupv1 and CaseSensitivity) failed but seems unrelated to the changes. I have appended logs for those failed tests in the end.
I'm also dog fooding this build right now.
Localization: All end user facing strings can be localized
Dev docs: Added/updated if needed
Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx
Detailed Description of the Pull Request / Additional comments
Validation Steps Performed
Failed unit tests