Skip to content

fix: add zombie radio recovery for TX fail reset and RX stuck reboot#2151

Open
KG7QIN wants to merge 1 commit intomeshcore-dev:devfrom
KG7QIN:fix/zombie-radio-recovery
Open

fix: add zombie radio recovery for TX fail reset and RX stuck reboot#2151
KG7QIN wants to merge 1 commit intomeshcore-dev:devfrom
KG7QIN:fix/zombie-radio-recovery

Conversation

@KG7QIN
Copy link

@KG7QIN KG7QIN commented Mar 25, 2026

Problem

Two silent failure modes can leave the radio non-functional with no
recovery path:

  1. TX zombie: startSendRaw() fails but the packet is silently
    dropped. The node appears to be running but nothing is transmitted.
  2. RX zombie: The radio leaves RX mode and never returns.
    ERR_EVENT_STARTRX_TIMEOUT was already detecting this after 8s but
    no recovery action was taken.

Changes

Dispatcher

On startSendRaw() failure, re-queue the packet instead of dropping it.
Increment tx_fail_count on each failure and call onTxStuck() when the
count reaches the configured threshold. Reset to 0 on any successful TX.

On STARTRX_TIMEOUT, increment rx_stuck_count and call onRxStuck()
each 8s window. When the count reaches the threshold, call
onRxUnrecoverable(). After each recovery attempt, reset
radio_nonrx_start, prev_isrecv_mode, cad_busy_start, and
next_agc_reset_time so the 8s window restarts cleanly. Reset
rx_stuck_count to 0 when the radio returns to RX.

Virtual hook API

virtual uint8_t getTxFailResetThreshold() const { return 3; }
virtual void    onTxStuck()                     { _radio->resetAGC(); }
virtual uint8_t getRxFailRebootThreshold() const { return 3; }
virtual void    onRxStuck()                     { _radio->resetAGC(); }
virtual void    onRxUnrecoverable()             { }

Defaults are conservative (threshold 3, AGC reset as first attempt,
no-op unrecoverable). Subclasses override to reboot or do a deeper reset.

NodePrefs / CLI

Thresholds are persisted and adjustable at runtime:

get tx.fail.threshold       -> current value (default 3)
set tx.fail.threshold 5     -> "OK - tx fail reset after 5 failures"
set tx.fail.threshold 0     -> "OK - tx fail reset disabled"

get rx.fail.threshold       -> current value (default 3)
set rx.fail.threshold 5     -> "OK - reboot after 5 rx recovery failures"
set rx.fail.threshold 0     -> "OK - rx fail reboot disabled"

CommonCLI stores these at fields 291/292 (next free: 293).
Companion radio stores at its own DataStore fields 90/91 (next free: 92).
Both default to 3 on first boot or upgrade from older firmware.

Example overrides

All four examples override the threshold getters to return the persisted
prefs value and implement onRxUnrecoverable() to call board.reboot().

Dispatcher: re-queue packets on startSendRaw() failure instead of
dropping them. After N consecutive TX failures, call onTxStuck()
(default: resetAGC). After N failed RX recovery attempts
(ERR_EVENT_STARTRX_TIMEOUT), call onRxUnrecoverable() (default: no-op,
overridden in examples to soft-reboot). Both thresholds are configurable
via terminal (set tx.fail.threshold / set rx.fail.threshold, 0=disabled,
range 1-10, default 3) and persisted in NodePrefs. Companion radio uses
its own NodePrefs/DataStore at fields 90-91; all other examples use
CommonCLI fields 291-292.
@KG7QIN
Copy link
Author

KG7QIN commented Mar 25, 2026

This is a common problem with bot operators. The radios will become zombies failing on TX or RX and causing problems.

The only way to fix this is to power cycle the radio by pressing the reset button or pulling power.

This is an attempt at fixing this without having to have physical access to the radio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant