Skip to content

A 10s OTA timeout is too short for slower connections#54

Open
pekenator1 wants to merge 1 commit into
masterfrom
pek/vllp_ota_fix
Open

A 10s OTA timeout is too short for slower connections#54
pekenator1 wants to merge 1 commit into
masterfrom
pek/vllp_ota_fix

Conversation

@pekenator1
Copy link
Copy Markdown
Collaborator

To summarize what's wrong and what was changed:

OTA would not terminate gracefully even if successful on a slow CAN channel.

Why it broke: vllp_channel_send is non-blocking — it just enqueues blocks onto the TX queue. vllp_channel_read was called with a 10-second deadline immediately after the last channel_send, so the clock started before a single block had even been
transmitted. Stop-and-wait over CAN at MTU=8 — one round-trip per 128-byte block — plus the flash erase and write easily exceeds 10 seconds. The deadline always fired before the MCU's fin=0 status byte arrived.

Why infinite wait would have hung before this fix: vllp_disconnect (called by the 30-second VLLP inactivity timer after the device reboots) never signalled rxq_cond for non-reconnect established channels. The reconnect path already called channel_enq_rx_meta to unblock waiters; the non-reconnect path just called the eof callback and freed the channel. A vllp_channel_read(-1) would block forever.

Two-part fix:

  1. vllp.c — vllp_disconnect: added channel_enq_rx_meta(vc, error, VLLP_PKT_EOF) for non-reconnect ESTABLISHED channels, mirroring what the reconnect path already did. This unblocks any vllp_channel_read(-1) when the VLLP inactivity timer fires.

  2. vllp_ota.c — vllp_do_ota: changed timeout from 1010001000 to -1. The VLLP-level inactivity timeout (configurable, 30 s in the test) is now the real deadline. VLLP_ERR_TIMEOUT from disconnect and a clean EOF with no status byte are both treated as success — the device rebooted to apply the firmware, which is the expected outcome.

To summarize what's wrong and what was changed:

OTA would not terminate gracefully even if successful on
a slow CAN channel.

Why it broke: vllp_channel_send is non-blocking — it just enqueues
blocks onto the TX queue. vllp_channel_read was called with a
10-second deadline immediately after the last channel_send, so the
clock started before a single block had even been
transmitted. Stop-and-wait over CAN at MTU=8 — one round-trip per
128-byte block — plus the flash erase and write easily exceeds 10
seconds. The deadline always fired before the MCU's fin=0 status byte
arrived.

Why infinite wait would have hung before this fix:
vllp_disconnect (called by the 30-second VLLP inactivity timer after
the device reboots) never signalled rxq_cond for non-reconnect
established channels. The reconnect path already called
channel_enq_rx_meta to unblock waiters; the non-reconnect path just
called the eof callback and freed the channel. A vllp_channel_read(-1)
would block forever.

Two-part fix:

1. vllp.c — vllp_disconnect: added channel_enq_rx_meta(vc, error,
VLLP_PKT_EOF) for non-reconnect ESTABLISHED channels, mirroring what
the reconnect path already did.  This unblocks any
vllp_channel_read(-1) when the VLLP inactivity timer fires.

2. vllp_ota.c — vllp_do_ota: changed timeout from 10*1000*1000 to
-1. The VLLP-level inactivity timeout (configurable, 30 s in the test)
is now the real deadline.  VLLP_ERR_TIMEOUT from disconnect and a
clean EOF with no status byte are both treated as success — the device
rebooted to apply the firmware, which is the expected outcome.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant