fix: add zombie radio recovery for TX fail reset and RX stuck reboot#2151
Open
KG7QIN wants to merge 1 commit intomeshcore-dev:devfrom
Open
fix: add zombie radio recovery for TX fail reset and RX stuck reboot#2151KG7QIN wants to merge 1 commit intomeshcore-dev:devfrom
KG7QIN wants to merge 1 commit intomeshcore-dev:devfrom
Conversation
Dispatcher: re-queue packets on startSendRaw() failure instead of dropping them. After N consecutive TX failures, call onTxStuck() (default: resetAGC). After N failed RX recovery attempts (ERR_EVENT_STARTRX_TIMEOUT), call onRxUnrecoverable() (default: no-op, overridden in examples to soft-reboot). Both thresholds are configurable via terminal (set tx.fail.threshold / set rx.fail.threshold, 0=disabled, range 1-10, default 3) and persisted in NodePrefs. Companion radio uses its own NodePrefs/DataStore at fields 90-91; all other examples use CommonCLI fields 291-292.
Author
|
This is a common problem with bot operators. The radios will become zombies failing on TX or RX and causing problems. The only way to fix this is to power cycle the radio by pressing the reset button or pulling power. This is an attempt at fixing this without having to have physical access to the radio. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two silent failure modes can leave the radio non-functional with no
recovery path:
startSendRaw()fails but the packet is silentlydropped. The node appears to be running but nothing is transmitted.
ERR_EVENT_STARTRX_TIMEOUTwas already detecting this after 8s butno recovery action was taken.
Changes
Dispatcher
On
startSendRaw()failure, re-queue the packet instead of dropping it.Increment
tx_fail_counton each failure and callonTxStuck()when thecount reaches the configured threshold. Reset to 0 on any successful TX.
On
STARTRX_TIMEOUT, incrementrx_stuck_countand callonRxStuck()each 8s window. When the count reaches the threshold, call
onRxUnrecoverable(). After each recovery attempt, resetradio_nonrx_start,prev_isrecv_mode,cad_busy_start, andnext_agc_reset_timeso the 8s window restarts cleanly. Resetrx_stuck_countto 0 when the radio returns to RX.Virtual hook API
Defaults are conservative (threshold 3, AGC reset as first attempt,
no-op unrecoverable). Subclasses override to reboot or do a deeper reset.
NodePrefs / CLI
Thresholds are persisted and adjustable at runtime:
CommonCLI stores these at fields 291/292 (next free: 293).
Companion radio stores at its own DataStore fields 90/91 (next free: 92).
Both default to 3 on first boot or upgrade from older firmware.
Example overrides
All four examples override the threshold getters to return the persisted
prefs value and implement
onRxUnrecoverable()to callboard.reboot().