A 10s OTA timeout is too short for slower connections#54
Open
pekenator1 wants to merge 1 commit into
Open
Conversation
To summarize what's wrong and what was changed: OTA would not terminate gracefully even if successful on a slow CAN channel. Why it broke: vllp_channel_send is non-blocking — it just enqueues blocks onto the TX queue. vllp_channel_read was called with a 10-second deadline immediately after the last channel_send, so the clock started before a single block had even been transmitted. Stop-and-wait over CAN at MTU=8 — one round-trip per 128-byte block — plus the flash erase and write easily exceeds 10 seconds. The deadline always fired before the MCU's fin=0 status byte arrived. Why infinite wait would have hung before this fix: vllp_disconnect (called by the 30-second VLLP inactivity timer after the device reboots) never signalled rxq_cond for non-reconnect established channels. The reconnect path already called channel_enq_rx_meta to unblock waiters; the non-reconnect path just called the eof callback and freed the channel. A vllp_channel_read(-1) would block forever. Two-part fix: 1. vllp.c — vllp_disconnect: added channel_enq_rx_meta(vc, error, VLLP_PKT_EOF) for non-reconnect ESTABLISHED channels, mirroring what the reconnect path already did. This unblocks any vllp_channel_read(-1) when the VLLP inactivity timer fires. 2. vllp_ota.c — vllp_do_ota: changed timeout from 10*1000*1000 to -1. The VLLP-level inactivity timeout (configurable, 30 s in the test) is now the real deadline. VLLP_ERR_TIMEOUT from disconnect and a clean EOF with no status byte are both treated as success — the device rebooted to apply the firmware, which is the expected outcome.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To summarize what's wrong and what was changed:
OTA would not terminate gracefully even if successful on a slow CAN channel.
Why it broke: vllp_channel_send is non-blocking — it just enqueues blocks onto the TX queue. vllp_channel_read was called with a 10-second deadline immediately after the last channel_send, so the clock started before a single block had even been
transmitted. Stop-and-wait over CAN at MTU=8 — one round-trip per 128-byte block — plus the flash erase and write easily exceeds 10 seconds. The deadline always fired before the MCU's fin=0 status byte arrived.
Why infinite wait would have hung before this fix: vllp_disconnect (called by the 30-second VLLP inactivity timer after the device reboots) never signalled rxq_cond for non-reconnect established channels. The reconnect path already called channel_enq_rx_meta to unblock waiters; the non-reconnect path just called the eof callback and freed the channel. A vllp_channel_read(-1) would block forever.
Two-part fix:
vllp.c — vllp_disconnect: added channel_enq_rx_meta(vc, error, VLLP_PKT_EOF) for non-reconnect ESTABLISHED channels, mirroring what the reconnect path already did. This unblocks any vllp_channel_read(-1) when the VLLP inactivity timer fires.
vllp_ota.c — vllp_do_ota: changed timeout from 1010001000 to -1. The VLLP-level inactivity timeout (configurable, 30 s in the test) is now the real deadline. VLLP_ERR_TIMEOUT from disconnect and a clean EOF with no status byte are both treated as success — the device rebooted to apply the firmware, which is the expected outcome.