fix: add zombie radio recovery for TX fail reset and RX stuck reboot by KG7QIN · Pull Request #2151 · meshcore-dev/MeshCore

KG7QIN · 2026-03-25T03:11:44Z

Problem

Two silent failure modes can leave the radio non-functional with no
recovery path:

TX zombie: startSendRaw() fails but the packet is silently
dropped. The node appears to be running but nothing is transmitted.
RX zombie: The radio leaves RX mode and never returns.
ERR_EVENT_STARTRX_TIMEOUT was already detecting this after 8s but
no recovery action was taken.

Changes

Dispatcher

On startSendRaw() failure, re-queue the packet instead of dropping it.
Increment tx_fail_count on each failure and call onTxStuck() when the
count reaches the configured threshold. Reset to 0 on any successful TX.

On STARTRX_TIMEOUT, increment rx_stuck_count and call onRxStuck()
each 8s window. When the count reaches the threshold, call
onRxUnrecoverable(). After each recovery attempt, reset
radio_nonrx_start, prev_isrecv_mode, cad_busy_start, and
next_agc_reset_time so the 8s window restarts cleanly. Reset
rx_stuck_count to 0 when the radio returns to RX.

Virtual hook API

virtual uint8_t getTxFailResetThreshold() const { return 3; }
virtual void    onTxStuck()                     { _radio->resetAGC(); }
virtual uint8_t getRxFailRebootThreshold() const { return 3; }
virtual void    onRxStuck()                     { _radio->resetAGC(); }
virtual void    onRxUnrecoverable()             { }

Defaults are conservative (threshold 3, AGC reset as first attempt,
no-op unrecoverable). Subclasses override to reboot or do a deeper reset.

NodePrefs / CLI

Thresholds are persisted and adjustable at runtime:

get tx.fail.threshold       -> current value (default 3)
set tx.fail.threshold 5     -> "OK - tx fail reset after 5 failures"
set tx.fail.threshold 0     -> "OK - tx fail reset disabled"

get rx.fail.threshold       -> current value (default 3)
set rx.fail.threshold 5     -> "OK - reboot after 5 rx recovery failures"
set rx.fail.threshold 0     -> "OK - rx fail reboot disabled"

CommonCLI stores these at fields 291/292 (next free: 293).
Companion radio stores at its own DataStore fields 90/91 (next free: 92).
Both default to 3 on first boot or upgrade from older firmware.

Example overrides

All four examples override the threshold getters to return the persisted
prefs value and implement onRxUnrecoverable() to call board.reboot().

Dispatcher: re-queue packets on startSendRaw() failure instead of dropping them. After N consecutive TX failures, call onTxStuck() (default: resetAGC). After N failed RX recovery attempts (ERR_EVENT_STARTRX_TIMEOUT), call onRxUnrecoverable() (default: no-op, overridden in examples to soft-reboot). Both thresholds are configurable via terminal (set tx.fail.threshold / set rx.fail.threshold, 0=disabled, range 1-10, default 3) and persisted in NodePrefs. Companion radio uses its own NodePrefs/DataStore at fields 90-91; all other examples use CommonCLI fields 291-292.

KG7QIN · 2026-03-25T03:38:11Z

This is a common problem with bot operators. The radios will become zombies failing on TX or RX and causing problems.

The only way to fix this is to power cycle the radio by pressing the reset button or pulling power.

This is an attempt at fixing this without having to have physical access to the radio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add zombie radio recovery for TX fail reset and RX stuck reboot#2151

fix: add zombie radio recovery for TX fail reset and RX stuck reboot#2151
KG7QIN wants to merge 1 commit intomeshcore-dev:devfrom
KG7QIN:fix/zombie-radio-recovery

KG7QIN commented Mar 25, 2026

Uh oh!

KG7QIN commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KG7QIN commented Mar 25, 2026

Problem

Changes

Dispatcher

Virtual hook API

NodePrefs / CLI

Example overrides

Uh oh!

KG7QIN commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant