Fix: Re-queue packet on send failure instead of dropping it#2141
Fix: Re-queue packet on send failure instead of dropping it#2141OverkillFPV wants to merge 3 commits intomeshcore-dev:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes the Dispatcher TX failure behavior so that when startSendRaw() fails (e.g., radio busy after CAD timeout), the outbound packet is re-queued for a delayed retry instead of being released back to the pool (dropped).
Changes:
- Re-queue the outbound packet on
startSendRaw()failure rather than freeing it. - Add a retry delay before attempting further TX after a send-start failure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/Dispatcher.cpp
Outdated
| _mgr->queueOutbound(outbound, 0, futureMillis(getCADFailRetryDelay())); | ||
| outbound = NULL; | ||
| next_tx_time = futureMillis(getCADFailRetryDelay()); |
There was a problem hiding this comment.
getCADFailRetryDelay() is called twice here (once for queueOutbound and once for next_tx_time). Since getCADFailRetryDelay() can be non-deterministic (e.g., Mesh::getCADFailRetryDelay() uses RNG), the packet’s scheduled retry time and the global next_tx_time throttle can diverge unpredictably. Compute the retry delay once and reuse it for both scheduling and next_tx_time.
| _mgr->queueOutbound(outbound, 0, futureMillis(getCADFailRetryDelay())); | |
| outbound = NULL; | |
| next_tx_time = futureMillis(getCADFailRetryDelay()); | |
| int retry_delay = getCADFailRetryDelay(); | |
| unsigned long retry_time = futureMillis(retry_delay); | |
| _mgr->queueOutbound(outbound, 0, retry_time); | |
| outbound = NULL; | |
| next_tx_time = retry_time; |
src/Dispatcher.cpp
Outdated
| releasePacket(outbound); // return to pool | ||
|
|
||
| // re-queue instead of dropping so the packet gets another chance | ||
| _mgr->queueOutbound(outbound, 0, futureMillis(getCADFailRetryDelay())); |
There was a problem hiding this comment.
Re-queuing with a hard-coded priority 0 can change ordering semantics: lower numeric values are treated as higher priority in PacketQueue::get(), so this elevates any packet that happened to fail startSendRaw() to highest priority on retry and may starve other queued traffic. Consider preserving the original outbound priority (e.g., by storing priority on the Packet or extending PacketManager::getNextOutbound to return both packet + priority) when re-queuing.
| @@ -330,9 +330,11 @@ void Dispatcher::checkSend() { | |||
| MESH_DEBUG_PRINTLN("%s Dispatcher::loop(): ERROR: send start failed!", getLogDateTime()); | |||
There was a problem hiding this comment.
The debug string says Dispatcher::loop() but this code is inside Dispatcher::checkSend(), which makes logs misleading when diagnosing TX failures. Update the message to the correct function name.
| MESH_DEBUG_PRINTLN("%s Dispatcher::loop(): ERROR: send start failed!", getLogDateTime()); | |
| MESH_DEBUG_PRINTLN("%s Dispatcher::checkSend(): ERROR: send start failed!", getLogDateTime()); |
3f9190d to
9675823
Compare
|
I was also wondering about increasing the timeout to >4 seconds as for some nodes with 50 neighbours it's not too unheard of to have 4 seconds of traffic |
When the CAD timeout's due to high traffic the packet is force sent after 4 seconds. And StartSendRaw fails due to the radio being busy the packet is dropped. This change changes it so that instead of dropping the packet entirely, it is requeued with a delay and retried. This is why some of my high traffic nodes with 20+ neighbours have become unreliable.