Skip to content

Conversation

@ewowi
Copy link
Collaborator

@ewowi ewowi commented Jan 24, 2026

Summary by CodeRabbit

  • New Features

    • Network messages now include version, uptime, package size, lighting state (on/brightness/palette/preset) and a control flag for remote commands.
  • Refactor

    • Devices integrated with lights control and app lifecycle; introduced origin-aware update flow and faster periodic loops (20ms, 10s) for state sync.
  • UI

    • "Devices" menu visibility now follows the MoonLight feature toggle.
  • Chores

    • Minor logging, comments, and config adjustments; build date updated.

✏️ Tip: You can customize this high-level summary in your review settings.

backend
=======
- main: moduleLightsControl added as argument to ModuleDevices
- Devices: extend UDPMessage with version, uptime, brightness, palette, preset, add LightsControl->addUpdateHandler sending UDP message when something changes, updateDevices: send new vars
@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

Walkthrough

ModuleDevices gains richer UDPMessage fields and a Module* control dependency; UDP receive/send now route control vs status and update devices/UI. ModuleState/Module update APIs were made origin-aware (originId) and multiple modules’ onUpdate signatures updated; ModuleDevices integrated into main with 20ms/10s loops.

Changes

Cohort / File(s) Summary
Device State & API
src/MoonBase/Modules/ModuleDevices.h
Expanded UDPMessage with version, uptime, packageSize, lightsOn, brightness, palette, preset, isControlCommand; added Module* _moduleControl; new ctor ModuleDevices(..., Module* moduleControl); updateDevices now takes const UDPMessage& and IPAddress.
UDP I/O & Logic
src/MoonBase/Modules/...ModuleDevices.*
Replaced readUDP/writeUDP with receiveUDP/sendUDP; receiveUDP validates/parses UDPMessage, skips own broadcasts, and either applies control to _moduleControl or calls updateDevices(message, ip); sendUDP builds message (host/version/uptime/state) and can include device updates.
Origin-aware update plumbing
src/MoonBase/Module.h, src/MoonBase/Module.cpp
Introduced mutexedUpdatedItem, mutexedOriginId, mutexedUpdatePending; threaded originId through compareRecursive, checkReOrderSwap, postUpdate, processUpdatedItem, and onUpdate.
Module integrations & loops
src/main.cpp
Construct ModuleLightsControl before ModuleDevices; instantiate ModuleDevices with ModuleLightsControl*; register moduleDevices under MoonLight; added loop20ms() and loop10s() calls for moduleDevices; adjusted initialization ordering.
Module onUpdate signatures
src/MoonBase/Modules/ModuleIO.h, src/MoonBase/NodeManager.h, src/MoonLight/Modules/ModuleChannels.h, src/MoonLight/Modules/ModuleLiveScripts.h, src/MoonLight/Modules/ModuleLightsControl.h
Updated many onUpdate overrides to onUpdate(const UpdatedItem& updatedItem, const String& originId); ModuleIO uses originId to gate server-origin updates and logging.
UI toggle
interface/src/routes/menu.svelte
Changed Devices submenu feature toggle from page.data.features.moonbase to page.data.features.moonlight.
Config / CI
.coderabbit.yaml
Replaced flat review/context keys with early_access and consolidated reviews block; added tools.github-checks.timeout_ms and chat.auto_reply.
Minor
lib/framework/StatefulService.h, platformio.ini
Silenced a 100ms ESP_LOGI line; updated APP_DATE build flag to 20260126.

Sequence Diagram

sequenceDiagram
    participant UI as UI/REST
    participant MC as ModuleLightsControl
    participant MD as ModuleDevices
    participant NET as UDP Network

    MC->>MD: local state update (brightness,palette,preset,lightsOn) with originId
    activate MD
    MD->>MD: build UDPMessage (hostname,version,uptime,packageSize,lightsOn,brightness,palette,preset,isControlCommand=false)
    MD->>NET: sendUDP(broadcast/target)
    MD->>UI: optionally emit device state update
    deactivate MD

    NET->>MD: incoming UDPMessage (remote IP)
    activate MD
    MD->>MD: receiveUDP -> parse & validate UDPMessage (skip own)
    alt message.isControlCommand == true
        MD->>MC: apply control update (use originId "udp")
    else
        MD->>MD: updateDevices(message, remoteIP)
        MD->>UI: update device entry (name,version,uptime,packageSize,lightsOn,brightness,palette,preset)
    end
    deactivate MD
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • prep71 #106 — Overlaps ModuleState/Module update machinery and mutexed origin-aware handling.
  • Prepare for 0.7.1 #94 — Touches origin propagation and onUpdate signatures across modules.
  • V070 preps #79 — Modifies ModuleLightsControl (onUpdate and related read/preset logic), overlapping interfaces used by ModuleDevices.

Suggested labels

enhancement

Poem

🐰 I hop on packets, tail a-flutter,
I count uptimes and prune a clutter,
presets hop in, brightness in tune,
control or status — I hum with the moon,
constructors snug, the network's butter.

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title "Super sync part 1" is vague and does not clearly convey the main technical changes in the PR, such as UDP synchronization improvements, device control messages, or origin tracking. Revise the title to be more specific and descriptive of the primary changes, such as "Add UDP device control and origin tracking for device synchronization" or "Implement device-to-device UDP control with origin ID tracking".
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch supersync-part1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 47-54: The code is assigning the whole JSON object
state.data["preset"] to message.preset (and logging it), which when cast to
uint8_t becomes 0; instead extract the selected preset field (e.g., the numeric
property like "selected" or "value") from the JSON object and assign that to
message.preset, and log that numeric value rather than serializing the whole
object; update the lambda that handles ModuleState (references: ModuleState,
state.data["preset"], message.preset, serializeJson, EXT_LOGD, MB_TAG) to read
the correct subfield and convert it to uint8_t before assignment and logging
(apply the same change to the other occurrences around lines 188-191).
- Line 58: The assignment to message.uptime currently computes time(nullptr) -
millis()/1000 which yields the boot timestamp and can underflow; update the two
locations where message.uptime is set (the message.uptime assignments in
ModuleDevices.h) to assign uptime seconds directly as millis()/1000 if you
intend to store uptime, or if you intended to store the boot epoch rename the
field (e.g., bootTime) and then compute boot time as time(nullptr) -
(millis()/1000); also ensure any consumer code uses the new semantics/field
name.
- Around line 169-175: The UDPMessage struct is partially uninitialized when
packetSize < sizeof(UDPMessage); before memcpy into UDPMessage (symbol: message)
zero-initialize it (e.g., memset(&message, 0, sizeof(message)) or
value-initialize UDPMessage message) so any trailing fields like
version/brightness/palette/preset are deterministic, then perform deviceUDP.read
into buffer and memcpy(&message, buffer, min(packetSize, sizeof(message))).
Ensure you still use ::min(packetSize, sizeof(message)) for the copy length.

_moduleName);

EXT_LOGD(MB_TAG, "b: %d pa %d pr %d", message.brightness, message.palette, message.preset);
message.uptime = time(nullptr) - millis() / 1000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Uptime calculation returns boot time (and can underflow).

time(nullptr) - millis()/1000 yields the boot timestamp, not uptime, and can underflow before time is set. If you want uptime seconds, use millis()/1000; if you want boot time, rename the field.

🔧 Suggested fix
-        message.uptime = time(nullptr) - millis() / 1000;
+        message.uptime = millis() / 1000;
-      message.uptime = time(nullptr) - millis() / 1000;
+      message.uptime = millis() / 1000;

Also applies to: 194-194

🤖 Prompt for AI Agents
In `@src/MoonBase/Modules/ModuleDevices.h` at line 58, The assignment to
message.uptime currently computes time(nullptr) - millis()/1000 which yields the
boot timestamp and can underflow; update the two locations where message.uptime
is set (the message.uptime assignments in ModuleDevices.h) to assign uptime
seconds directly as millis()/1000 if you intend to store uptime, or if you
intended to store the boot epoch rename the field (e.g., bootTime) and then
compute boot time as time(nullptr) - (millis()/1000); also ensure any consumer
code uses the new semantics/field name.

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 24, 2026

@coderabbitai, for some reason the package send in _moduleLightsControl->addUpdateHandler arrives very late (in the next 10s update?) in another device, which calls readUDP(), calling updateDevices. The 10s message (mostly... packages missed?)
Why is that? Could it be that loop1s() (every second) misses packages?

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Find the ModuleDevices implementation file
fd -e cpp -e h "ModuleDevices" --exec cat {}

echo "=== Separator: Looking for loop1s and loop10s implementations ==="

# Find loop1s and loop10s implementations
rg -n -A 20 "void.*loop1s\(\)" --type cpp

echo "=== Separator: Looking for loop10s ==="

rg -n -A 20 "void.*loop10s\(\)" --type cpp

echo "=== Separator: Looking for UDP receive buffer or socket setup ==="

# Look for UDP setup and buffer configuration
rg -n -C 5 "UDP|udp" src/MoonBase/Modules/ModuleDevices.h src/MoonBase/Modules/ModuleDevices.cpp

Length of output: 24698


🏁 Script executed:

#!/bin/bash

# Find where loop1s and loop10s are called from the main loop
rg -n -B 5 -A 5 "loop1s\(\)|loop10s\(\)" --type cpp | head -80

echo "=== Separator: Check NetworkUDP buffer behavior ==="

# Search for NetworkUDP definition or WiFiUDP usage
rg -n "class NetworkUDP|typedef.*NetworkUDP|WiFiUDP" --type cpp --type-add 'h:*.h'

Length of output: 4693


🏁 Script executed:

#!/bin/bash

# Find NetworkUDP definition
rg -n "NetworkUDP" src/ --type-add 'h:*.h' -A 2 -B 2

echo "=== Separator: Check Utilities.h for NetworkUDP ==="

# Check Utilities.h which is included in ModuleDevices.h
cat src/MoonBase/Utilities.h 2>/dev/null || echo "File not found"

echo "=== Separator: Search in all files ==="

fd -e h Utilities.h --exec cat {}

Length of output: 23214


🏁 Script executed:

#!/bin/bash

# Search for NetworkUDP definition or typedef
rg -n "typedef.*NetworkUDP|define.*NetworkUDP|using NetworkUDP" --type-add 'h:*.h'

echo "=== Separator: Search in lib or platformio dependencies ==="

# Check if there's a lib folder with ESP32SvelteKit
find . -name "*.h" -type f | xargs rg -l "NetworkUDP" | head -10

echo "=== Separator: Look for ESP32SvelteKit or ESPFS files ==="

# Find ESPFS.h which is included
fd ESPFS.h --exec cat {} | head -100

Length of output: 766


🏁 Script executed:

#!/bin/bash

# Search more broadly for NetworkUDP or WiFiUDP
rg -n "WiFiUDP|EthernetUDP" --type cpp --type-add 'h:*.h' -B 2 -A 2

echo "=== Separator: Check platformio.ini for dependencies ==="

cat platformio.ini 2>/dev/null | head -50

echo "=== Separator: Look for lib_deps or framework ==="

rg -n "lib_deps|framework.*=" platformio.ini 2>/dev/null | head -20

Length of output: 4475


🏁 Script executed:

#!/bin/bash

# Search for where NetworkUDP might be defined - could be in ESP32SvelteKit library
rg -n "NetworkUDP" -B 5 -A 5 --type-add 'h:*.h'

echo "=== Separator: Check if it's WiFi or Ethernet based ==="

# Look for WiFi and ETH usage in context
rg -n "WiFi\.|ETH\." src/MoonBase/Modules/ModuleDevices.h -C 3

Length of output: 2683


@ewowi, I found the issue! The problem is in the readUDP() function (lines 168-179). It only reads one UDP packet per call, but it's called only once per second in loop1s().

Root Cause

When _moduleLightsControl->addUpdateHandler fires (lines 41-66), it immediately broadcasts a UDP packet. If brightness, palette, or preset change multiple times within a second (or even across a few seconds), multiple UDP packets are sent and queue up in the receiving device's UDP buffer.

However, readUDP() only processes one packet per second:

void readUDP() {
    size_t packetSize = deviceUDP.parsePacket();  // ← Only called ONCE
    if (packetSize >= 38) {
      // ... process one packet
      updateDevices(message, deviceUDP.remoteIP());
    }
}

If 10 packets are queued, it takes 10 seconds to drain them all at 1 packet/second, which explains the "next 10s update" delay you're seeing.

Solution

Drain all available packets in readUDP() by using a while loop:

void readUDP() {
  // Process all available UDP packets in the buffer
  while (size_t packetSize = deviceUDP.parsePacket()) {
    if (packetSize >= 38) {  // WLED has 44, MM had ! 38
      char buffer[packetSize];
      UDPMessage message;
      deviceUDP.read(buffer, packetSize);
      memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
      // EXT_LOGD(ML_TAG, "UDP packet read from %d: %s (%d)", deviceUDP.remoteIP()[3], buffer + 6, packetSize);

      updateDevices(message, deviceUDP.remoteIP());
    }
  }
}

This will process all queued packets immediately instead of one per second, eliminating the delay.

backend
=======
- main: add last20ms and use for devices
- devices: loop20 for direct response, readudp loops over received packets, use preset.selected
@ewowi
Copy link
Collaborator Author

ewowi commented Jan 24, 2026

@coderabbitai, can you review latest commit: ced9a76

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the specific commit details
git show ced9a76e876ae0ffe6f2952e1d45f804f9aaffef --stat

echo "=== Separator: Full diff of the commit ==="

# Get the full diff of the commit
git show ced9a76e876ae0ffe6f2952e1d45f804f9aaffef

Length of output: 7649


@ewowi, excellent implementation! 🎉 The commit addresses the UDP delay issue perfectly and adds significant improvements. Here's my review:

✅ Core Fix: readUDP() While Loop

Perfect implementation of the suggested fix:

while (size_t packetSize = deviceUDP.parsePacket()) {
  if (packetSize >= 38) {
    // ... process packet
  }
}

This drains all queued packets immediately instead of one per call, eliminating the backlog delay.

🚀 Major Improvement: 20ms Response Time

Renaming loop1s()loop20ms() and calling it every 20ms is brilliant! This improves UDP responsiveness by 50x (from 1000ms to 20ms). Combined with the while-loop, this means:

  • Before: 1 packet/second → 10 packets took 10 seconds
  • After: All packets drained every 20ms → 10 packets processed in ~20ms

📝 Other Improvements

  1. Zero-initialization: UDPMessage message{}; prevents garbage data in uninitialized fields ✓

  2. Correct preset field: Using state.data["preset"]["selected"] instead of state.data["preset"]

  3. Platform abstraction: pal::millis() instead of millis() for better portability ✓

  4. Uptime safety: time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000 handles cases where NTP time isn't set yet ✓

⚠️ Minor Considerations

  1. Stack allocation: char buffer[packetSize]; allocates on stack inside the loop. For typical UDP packets (<1500 bytes), this is fine, but if you ever receive unusually large packets, consider a static buffer or heap allocation.

  2. Loop duration: If packets arrive faster than 20ms intervals, the while-loop could process many packets in one call. This is generally fine, but if you ever see watchdog issues, you could add a packet count limit or taskYIELD() inside the loop.

  3. Nested ticker structure: The 20ms → 1s → 10s nesting works but is unconventional. Consider flattening to independent counters if you add more timing tiers.

Overall, this is a solid fix that should eliminate the 10-second delays completely! The 20ms response time is a huge win for device synchronization.


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-15T12:38:33.598Z
Learning: In MoonLight GameOfLifeEffect (src/MoonLight/Nodes/Effects/E_MoonModules.h): The effect's nested loops (especially lines 257-323) iterate through all cells (e.g., 128×128 = 16,384) with nested neighbor checks (18 per cell = 294,912 operations) without yielding. This starves effectTask (priority 3) when higher-priority tasks like HTTP server (priority 5) run, causing the effect to disappear. UI actions (save, refresh) temporarily show one frame because HTTP completes and briefly releases CPU back to effectTask. Solution: Add taskYIELD() inside outer X loop to prevent task starvation. This is the same root cause as issue `#102` (Starsky glitches with many driver nodes).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T17:07:33.344Z
Learning: FastLED library does not support direct 4-byte-per-pixel (CRGBW) buffers. It always expects CRGB (3-byte) arrays and internally generates the W channel from RGB using .setRgbw() API. Casting a 4-byte RGBW buffer to CRGB* causes misalignment (W becomes next pixel's R). For MoonLight's FastLED driver with channelsPerLight=4: must either (1) repack RGB in-place before FastLED.show() then restore after, (2) use driver-specific channelsPerLight=3 layout, or (3) allocate temporary RGB-only buffer. Option 1 (in-place repacking) is best for ESP32-D0 memory constraints.

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 25, 2026

@coderabbitai, in ModuleDevices, I extended the struct UDPMessage as follows:

struct UDPMessage {
uint8_t rommel[6];
Char<32> name;
Char<32> version;
uint32_t uptime;
uint16_t packageSize;
bool lightsOn;
uint8_t brightness;
uint8_t palette;
uint8_t preset;
};

(packageSize and lightsOn added).

If I flash this to 2 devices, an esp32-p4 and an esp32-s3, the p4 keeps running fine, but the s3 is not updating the UI showing the devices anymore (but it receives UDP packages). There are about 15 devices running in the network, which are updated on refreshing the screen, but not after that via its web socket connection. It works fine without packageSize and lightsOn added. On the browser I get the following error repeatedly - which is only 'occassionally' on the p4. I am not sure if it is because the p4 is faster and bigger ... so do not focus only on that - it might just be a coincidence it is a P4.

In general, I see this BUFFER_SHORTAGE message in the browser log for a long time, so if you can get to the bottom why this is and maybe we can avoid it knowing what causes this.

[Error] Error: BUFFER_SHORTAGE — bundle.js:3:9744
forEach
(anonymous function) (bundle.js:26:2701)
[Log] getState new module loaded – "devices" (bundle.js, line 38)
[Log] socketOn – "devices" (bundle.js, line 38)
[Log] getState – "/rest/devices" (bundle.js, line 38)
[Log] onMount Monitor (bundle.js, line 26)
[Log] get data – "/rest/devices" (bundle.js, line 38)
[Error] Error: BUFFER_SHORTAGE — bundle.js:3:9744
forEach
(anonymous function) (bundle.js:26:2701)
[Log] Monitor.handleChannels – Uint8Array [135, 12, 117, …] (6000) (bundle.js, line 26)
Uint8Array [135, 12, 117, 0, 0, 0, 0, 0, 0, 0, …]Uint8Array (6000)
[Log] Monitor.handleHeader – Uint8Array [10, 0, 0, …] (47) (bundle.js, line 26)
Uint8Array [10, 0, 0, 0, 10, 0, 0, 0, 10, 0, …]Uint8Array (47)
[Log] Monitor.handleHeader – 10 – 10 – 10 – 1000 – 6 – 0 – 6000 (bundle.js, line 26)
[Log] requestLayout submitted (bundle.js, line 26)
[Log] requestLayout received – {} (bundle.js, line 26)
[Log] Monitor.handlePositions – Uint8Array [0, 0, 0, …] (3000) (bundle.js, line 26)
Uint8Array [0, 0, 0, 0, 0, 1, 0, 0, 2, 0, …]Uint8Array (3000)
[Error] Error: BUFFER_SHORTAGE — bundle.js:3:9744
forEach
(anonymous function) (bundle.js:26:2701)

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== 1. Check UDPMessage struct definition and packing ==="
rg -n "struct UDPMessage" -A 15 -B 2 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 2. Check for struct packing directives ==="
rg -n "pragma pack|__attribute__.*packed|PACK_STRUCT" src/MoonBase/Modules/ModuleDevices.h src/MoonBase/Modules/ModuleDevices.cpp

echo -e "\n=== 3. Find readUDP and writeUDP memcpy usage ==="
rg -n "memcpy.*message|memcpy.*UDPMessage" -B 3 -A 3 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 4. Check how device state changes trigger WebSocket updates ==="
rg -n "stateChanged.*devices|emit.*devices|ws.*devices" -B 2 -A 5 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 5. Find WebSocket buffer configuration or BUFFER_SHORTAGE ==="
rg -n "BUFFER_SHORTAGE|WS_MAX|setBuffer|send.*async" src/

echo -e "\n=== 6. Check updateDevices function to see how often state changes ==="
rg -n "void updateDevices" -A 40 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 7. Check sizeof UDPMessage usage ==="
rg -n "sizeof.*UDPMessage|sizeof.*message" src/MoonBase/Modules/ModuleDevices.h

Length of output: 3063


🏁 Script executed:

#!/bin/bash

echo "=== 1. Find WebSocket state change emission after updateDevices ==="
rg -n "stateChanged|emitEvent|socket.*emit" -B 2 -A 3 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 2. Check Module base class for state emission mechanism ==="
rg -n "class Module" -A 30 src/MoonBase/Module.h | head -60

echo -e "\n=== 3. Look for struct alignment or packing issues ==="
# Check if there are any alignment pragmas or attributes
rg -n "alignas|aligned|pack\(|__packed" src/MoonBase/

echo -e "\n=== 4. Calculate struct sizes and check alignment ==="
# Use Python to calculate expected struct sizes with different alignments
python3 << 'PYPYTHON'
import struct

# Simulate UDPMessage layout
print("=== UDPMessage size analysis ===")
print("rommel[6]: 6 bytes")
print("Char<32> name: 32 bytes")
print("Char<32> version: 32 bytes")
print("uint32_t uptime: 4 bytes")
print("uint16_t packageSize: 2 bytes")
print("bool lightsOn: 1 byte")
print("uint8_t brightness: 1 byte")
print("uint8_t palette: 1 byte")
print("uint8_t preset: 1 byte")
print()
print("Natural size (no padding): 6+32+32+4+2+1+1+1+1 = 80 bytes")
print()
print("With typical padding (4-byte alignment):")
print("  rommel[6] + padding[2] = 8")
print("  name[32] = 32")
print("  version[32] = 32")
print("  uptime (4-aligned) = 4")
print("  packageSize (2) + lightsOn (1) + brightness (1) = 4 (packed)")
print("  palette (1) + preset (1) + padding[2] = 4")
print("  Total = 84 bytes")
print()
print("OLD struct (without packageSize, lightsOn):")
print("  6+32+32+4+1+1+1 = 77 bytes → with padding likely 80 bytes")

PYPYTHON

echo -e "\n=== 5. Find PsychicHttp WebSocket buffer configuration ==="
rg -n "CONFIG_WS|WS.*BUFFER|maxFrameSize|queueSize" platformio.ini

echo -e "\n=== 6. Check for state change debouncing or rate limiting ==="
rg -n "debounce|throttle|lastUpdate|updateInterval" src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 7. Look for how _state triggers WebSocket emission ==="
rg -n "_state.*data\[|stateChanged|changed.*true" -B 2 -A 2 src/MoonBase/Modules/ModuleDevices.h | head -80

Length of output: 3727


🏁 Script executed:

#!/bin/bash

echo "=== 1. Find StatefulService update mechanism and WebSocket emission ==="
rg -n "class StatefulService" -A 50 src/

echo -e "\n=== 2. Look for WebSocket emit/send in StatefulService ==="
rg -n "emit|send.*client|broadcast" src/MoonBase/StatefulService.h

echo -e "\n=== 3. Check if updateDevices triggers state change notification ==="
rg -n "void updateDevices" -A 60 src/MoonBase/Modules/ModuleDevices.h | grep -E "update\(|CHANGED|emit"

echo -e "\n=== 4. Find where devices state changes are committed ==="
# The key is whether updateDevices modifies _state.data directly or uses update()
rg -n "device\[\"time\"\]|device\[\"name\"\]" -B 5 -A 10 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 5. Look for ESP32 WebSocket buffer config in sdkconfig or build flags ==="
rg -n "build_flags|CONFIG_HTTPD" platformio.ini | head -40

echo -e "\n=== 6. Check for mixed firmware version handling ==="
# Check if readUDP handles different packet sizes gracefully
rg -n "parsePacket|packetSize" -B 2 -A 8 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 7. Search for BUFFER or queue settings in dependencies ==="
fd -e h -e cpp . lib/ --exec rg -l "BUFFER_SHORTAGE|maxFrameSize|queueSize" {} 2>/dev/null | head -10

Length of output: 2156


🏁 Script executed:

#!/bin/bash

echo "=== 1. Find StatefulService implementation in lib folder ==="
fd -e h StatefulService lib/ --exec cat {}

echo -e "\n=== 2. Check ESP32SvelteKit for WebSocket implementation ==="
fd -e h -e cpp ESP32SvelteKit lib/ --exec rg -l "emit|broadcast|send.*client" {} | head -5

echo -e "\n=== 3. Look at the complete updateDevices function to find update() call ==="
rg -n "void updateDevices" -A 75 src/MoonBase/Modules/ModuleDevices.h

echo -e "\n=== 4. Check for WebSocket async buffer settings in lib dependencies ==="
fd . lib/ -e h --exec rg -n "CONFIG_ASYNC|WS.*BUFFER|MAX.*FRAME|queue.*size" {} 2>/dev/null | head -20

echo -e "\n=== 5. Find how often loop20ms is called and if there's rate limiting ==="
rg -n "loop20ms|last20ms" src/MoonBase/Modules/ModuleDevices.h src/main.cpp

echo -e "\n=== 6. Check if there's any existing debounce logic in Module base ==="
rg -n "class Module" -A 100 src/MoonBase/Module.h | grep -E "debounce|throttle|last.*update|pending"

Length of output: 10886


@ewowi, I found the root cause! It's a WebSocket buffer overflow issue compounded by a struct size mismatch between old and new firmware versions. Here's what's happening:

🔴 Root Cause: WebSocket Flooding

Every UDP packet triggers a full WebSocket broadcast of all 15 devices (line 166 in updateDevices()):

void updateDevices(const UDPMessage& message, IPAddress ip) {
  // ... modify device in _state.data ...
  
  update(controls, ModuleState::update, String(_moduleName) + "server");  // ← Triggers WS broadcast
}

With 15 devices in your network and loop20ms() running every 20ms:

  • Potential update rate: 15 devices × 50 updates/second = 750 WebSocket messages/second
  • Each message contains the entire device list (all 15 devices)
  • ESP32-S3 has less RAM and slower CPU than P4, so its WebSocket send buffer fills up faster
  • BUFFER_SHORTAGE = ESP32's httpd_ws_send_frame_async() buffer is full (similar to the WebSocket monitor issue from learning 7b359365)

🟡 Secondary Issue: Struct Size Mismatch

You have 13 old devices (sending 77-80 byte packets) and 2 new devices (expecting 80-84 bytes):

# Old struct: 6+32+32+4+1+1+1 = 77 bytes → padded to ~80
# New struct: 6+32+32+4+2+1+1+1+1 = 80 bytes → padded to ~84

When a new device reads an old packet:

memcpy(&message, buffer, ::min(packetSize, sizeof(message)));  // ← 77 vs 84 bytes

The packageSize and lightsOn fields contain garbage data from uninitialized memory, which might:

  • Serialize incorrectly to JSON
  • Trigger unexpected code paths
  • Corrupt adjacent memory on S3 (more sensitive than P4)

✅ Solutions

1️⃣ Add Rate Limiting to updateDevices (Essential)

Only send WebSocket updates at most once per second per device:

void updateDevices(const UDPMessage& message, IPAddress ip) {
    // ... existing code to update device ...
    
    device["time"] = time(nullptr);
    device["name"] = message.name;
    device["version"] = message.version;
    device["uptime"] = message.uptime;
    device["brightness"] = message.brightness;
    device["palette"] = message.palette;
    device["preset"] = message.preset;
    
    if (!_socket->getConnectedClients()) return;
    
    // ✅ NEW: Rate limit WebSocket updates
    static time_t lastWsUpdate = 0;
    if (time(nullptr) - lastWsUpdate < 1) {  // Max 1 update/second
        return;  // Skip WebSocket emission
    }
    lastWsUpdate = time(nullptr);
    
    // ... existing sort and WebSocket emission code ...
    update(controls, ModuleState::update, String(_moduleName) + "server");
}

This reduces WebSocket updates from 750/second → 1/second, eliminating buffer overflow.

2️⃣ Fix Struct Alignment (Critical for Mixed Firmware)

Add packing directive to ensure consistent size across all devices:

struct UDPMessage {
  uint8_t rommel[6];
  Char<32> name;
  Char<32> version;
  uint32_t uptime;
  uint16_t packageSize;
  bool lightsOn;
  uint8_t brightness;
  uint8_t palette;
  uint8_t preset;
} __attribute__((packed));  // ✅ Force no padding

And zero-initialize to prevent garbage:

void readUDP() {
  while (size_t packetSize = deviceUDP.parsePacket()) {
    if (packetSize >= 38) {
      char buffer[packetSize];
      UDPMessage message{};  // ✅ Zero-initialize
      deviceUDP.read(buffer, packetSize);
      memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
      
      updateDevices(message, deviceUDP.remoteIP());
    }
  }
}

3️⃣ Validate Packet Size (Backward Compatibility)

Check for old vs new firmware:

if (packetSize >= 77 && packetSize <= sizeof(UDPMessage)) {  // Accept old and new
  // ... process packet ...
}

🔍 Why S3 Fails but P4 Works

ESP32-S3 ESP32-P4
~512KB RAM ~16MB RAM
Slower CPU Faster CPU
Buffer fills faster Larger buffers + faster processing
BUFFER_SHORTAGE Occasional BUFFER_SHORTAGE

The P4's extra RAM and speed mask the flooding issue, but both devices are being overwhelmed by 750 WebSocket messages/second.

📊 BUFFER_SHORTAGE Explanation

This error is long-standing because even without your struct changes, the system was close to the threshold. The 3 extra bytes per device pushed it over the edge:

  • 15 devices × 3 bytes = 45 extra bytes per message
  • 750 messages/second × 45 bytes = 33KB/second extra bandwidth

The solution is rate limiting, not increasing buffer sizes.


TL;DR: Add the rate limiting code (solution 1) and struct packing (solution 2), then reflash all devices. The UI freeze will disappear. 🎯


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T17:07:33.344Z
Learning: FastLED library does not support direct 4-byte-per-pixel (CRGBW) buffers. It always expects CRGB (3-byte) arrays and internally generates the W channel from RGB using .setRgbw() API. Casting a 4-byte RGBW buffer to CRGB* causes misalignment (W becomes next pixel's R). For MoonLight's FastLED driver with channelsPerLight=4: must either (1) repack RGB in-place before FastLED.show() then restore after, (2) use driver-specific channelsPerLight=3 layout, or (3) allocate temporary RGB-only buffer. Option 1 (in-place repacking) is best for ESP32-D0 memory constraints.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 25, 2026

@coderabbitai, I increased the buffer length to 200 and now also the P4 protests, I use chrome to monitor the P4 (S3 on safari). Chrome shows now:
(bundle.js:26 WebSocket connection to 'ws://ml-nano2.local/ws/events' failed: WebSocket is closed before the connection is established.)
Is that the same issue?

Regarding:

Potential update rate: 15 devices × 50 updates/second = 750 WebSocket messages/second. Each message contains the entire device list (all 15 devices)

it's not 50 updates per second the updateDevices is called, 50 times per second parsepacket is checked but only when new packages arrive they will be processed, that's every 10s per device, unless a device changes something in lights control (brightness, presets etc). Can you verify this statement and redo your math and conclusions

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== 1. Find where buffer length was set to 200 ==="
rg -n "200|buffer.*length|length.*200" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 3

echo -e "\n=== 2. Verify writeUDP is called from loop10s (every 10 seconds) ==="
rg -n "void.*loop10s|writeUDP" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 15

echo -e "\n=== 3. Check when addUpdateHandler fires (brightness/palette/preset changes) ==="
rg -n "addUpdateHandler" src/MoonBase/Modules/ModuleDevices.h -B 5 -A 30

echo -e "\n=== 4. Look for WebSocket configuration and buffer settings ==="
rg -n "WebSocket|ws.*buffer|WS_MAX|maxFrame|queueSize" platformio.ini src/MoonBase/ -B 2 -A 2

echo -e "\n=== 5. Check StatefulService update() for WebSocket send logic ==="
fd StatefulService.h lib/ --exec cat {} | head -200

echo -e "\n=== 6. Calculate actual struct size with __attribute__((packed)) ==="
python3 << 'PYPYTHON'
print("=== UDPMessage struct size ===")
print("Without packed:")
print("  uint8_t rommel[6]      : offset 0, size 6")
print("  Char<32> name          : offset 8 (padded), size 32")
print("  Char<32> version       : offset 40, size 32")
print("  uint32_t uptime        : offset 72, size 4")
print("  uint16_t packageSize   : offset 76, size 2")
print("  bool lightsOn          : offset 78, size 1")
print("  uint8_t brightness     : offset 79, size 1")
print("  uint8_t palette        : offset 80, size 1")
print("  uint8_t preset         : offset 81, size 1")
print("  Total with padding     : 84 bytes")
print()
print("With __attribute__((packed)):")
print("  Total: 6+32+32+4+2+1+1+1+1 = 80 bytes (no padding)")
print()
print("Old struct (before packageSize/lightsOn): 77 bytes natural, ~80 with padding")
PYPYTHON

echo -e "\n=== 7. Search for WebSocket error handling and close events ==="
rg -n "ws.*close|WebSocket.*close|onClose|_socket.*close" src/MoonBase/ -B 2 -A 5

echo -e "\n=== 8. Check if there's a char buffer[200] in readUDP ==="
rg -n "char buffer\[|uint8_t buffer\[" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 2

Length of output: 17993


@ewowi, I apologize for my incorrect calculation! You're absolutely right. Let me correct my analysis:

✅ Corrected Update Frequency

You're correct that parsePacket() is checked 50 times/second, but updateDevices() is only called when packets actually arrive:

  • Base rate: 15 devices × loop10s (0.1 Hz) = ~1.5 updates/second
  • Plus interactive changes: When users change brightness/palette/preset
  • NOT 750 updates/second as I mistakenly claimed

So WebSocket flooding is not the primary issue here. The actual problem is different.

🔴 Real Root Cause: Buffer Overflow

When you changed line 172 from:

char buffer[packetSize];  // VLA (Variable Length Array)

to:

char buffer[200];  // Fixed size

If packetSize > 200, the following line causes a buffer overflow:

deviceUDP.read(buffer, packetSize);  // ← Reads packetSize bytes into 200-byte buffer!

This can allow "arbitrary code execution or result in stack exhaustion". Stack-allocated buffers on ESP32 can cause "nasty crashes" when too large, and tasks "only have a limited amount of stack available".

The stack corruption from buffer overflow crashes the WebSocket handler, causing:

WebSocket connection to 'ws://ml-nano2.local/ws/events' failed: 
WebSocket is closed before the connection is established.

🟡 Why BUFFER_SHORTAGE Still Occurs

The BUFFER_SHORTAGE you've seen "for a long time" is likely unrelated to the struct size change. Possible causes:

  1. Large JSON payloads: Each WebSocket update sends all 15 devices' data
  2. Struct size mismatch: Old devices (77-80 bytes) vs. new devices (80-84 bytes) cause incorrect deserialization, which may produce malformed JSON
  3. Monitor data: The error appears alongside monitor data (handleChannels, handleHeader), suggesting monitor WebSocket frames compete for buffer space

✅ Solutions

1️⃣ Fix Buffer Overflow (Critical)

Validate packet size before reading:

void readUDP() {
  while (size_t packetSize = deviceUDP.parsePacket()) {
    if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
      EXT_LOGW(MB_TAG, "Invalid UDP packet size: %d (expected %d-%d)", 
               packetSize, 38, sizeof(UDPMessage));
      deviceUDP.flush();  // Discard invalid packet
      continue;
    }
    
    char buffer[sizeof(UDPMessage)];  // Use struct size, not arbitrary 200
    UDPMessage message{};  // Zero-initialize
    deviceUDP.read(buffer, packetSize);
    memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
    
    updateDevices(message, deviceUDP.remoteIP());
  }
}

2️⃣ Add Struct Packing (For Mixed Firmware)

Ensure consistent size across all devices:

struct UDPMessage {
  uint8_t rommel[6];
  Char<32> name;
  Char<32> version;
  uint32_t uptime;
  uint16_t packageSize;
  bool lightsOn;
  uint8_t brightness;
  uint8_t palette;
  uint8_t preset;
} __attribute__((packed));  // ✅ Force no padding

3️⃣ Investigate BUFFER_SHORTAGE (Long-term)

Since this is a pre-existing issue, investigate:

  • Monitor WebSocket frame sizes
  • Total device list JSON size (serialize and log it)
  • WebSocket buffer configuration in PsychicHttp

Would you like me to help investigate the BUFFER_SHORTAGE issue further once the buffer overflow is fixed?


✏️ Learnings added
Learnt from: ewowi
URL: https://github.com/MoonModules/MoonLight/pull/114

Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-15T12:38:33.598Z
Learning: In MoonLight GameOfLifeEffect (src/MoonLight/Nodes/Effects/E_MoonModules.h): The effect's nested loops (especially lines 257-323) iterate through all cells (e.g., 128×128 = 16,384) with nested neighbor checks (18 per cell = 294,912 operations) without yielding. This starves effectTask (priority 3) when higher-priority tasks like HTTP server (priority 5) run, causing the effect to disappear. UI actions (save, refresh) temporarily show one frame because HTTP completes and briefly releases CPU back to effectTask. Solution: Add taskYIELD() inside outer X loop to prevent task starvation. This is the same root cause as issue `#102` (Starsky glitches with many driver nodes).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

backend
=======
- main: move devices from MoonBase to MoonLight (for the time being, because it sends lightscontrol)
- devices: pack struct UDPMessage, write UDP in lights control updateHandler, add packageSize and lightsOn, add onUpdate sending updates to devices- WIP, writeUDP with including device update argument
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 188-197: Replace the VLA and unsafe reads by using a fixed-size
buffer constant (e.g., constexpr size_t MAX_UDP_PACKET = 512) and validate
packetSize against expected bounds before copying into a UDPMessage: if
packetSize < sizeof(UDPMessage) or packetSize > MAX_UDP_PACKET then read and
discard the packet into the fixed buffer (deviceUDP.read(discardBuf,
min(packetSize, MAX_UDP_PACKET))) and continue; otherwise read into the fixed
buffer (uint8_t buf[MAX_UDP_PACKET]) then memcpy(&message, buf,
sizeof(UDPMessage)) and call updateDevices(message, deviceUDP.remoteIP()).
Ensure all references to deviceUDP.parsePacket, deviceUDP.read, UDPMessage,
packetSize, buffer, and updateDevices are updated accordingly so no VLA is used
and out-of-range packets are drained.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)

201-226: uptime still encodes boot epoch when NTP is set.
If this field is meant to be uptime seconds, use pal::millis()/1000 consistently. If you want boot time, rename the field/label accordingly.

🔧 Suggested fix (uptime semantics)
-      message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
+      message.uptime = pal::millis() / 1000;
🧹 Nitpick comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)

75-107: Consider a short HTTP timeout to avoid blocking the UI loop if a device is offline.
A small timeout (or async) keeps the SvelteKit task responsive when POSTs can’t connect.

💡 Suggested tweak (if supported by HTTPClient)
   HTTPClient http;
+  http.setTimeout(1000);
   http.begin(url.c_str());
   http.addHeader("Content-Type", "application/json");

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 25, 2026

@coderabbitai, I consider replacing the httpclient call by an udp message send to the receiver, I guess it is more leightweight and faster, is that correct ?

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== 1. Find HTTP client usage in ModuleDevices ==="
rg -n "HTTPClient|http.*client|http.*Client|HttpClient" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 10

echo -e "\n=== 2. Find onUpdate implementation that does HTTP POST ==="
rg -n "onUpdate|HTTP.*POST|post.*request" src/MoonBase/Modules/ModuleDevices.h -B 5 -A 15

echo -e "\n=== 3. Check for REST endpoint calls ==="
rg -n "rest/|REST|endpoint" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 10

echo -e "\n=== 4. Look for writeUDP implementation for comparison ==="
rg -n "void.*writeUDP|writeUDP\(" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 25

echo -e "\n=== 5. Check includes for HTTP client libraries ==="
rg -n "#include.*HTTP|#include.*http" src/MoonBase/Modules/ModuleDevices.h

Length of output: 6523


@ewowi, yes, you're absolutely correct! UDP is significantly more lightweight and faster than the HTTP client call for this use case. Here's why:

📊 Comparison: HTTP vs UDP

Aspect HTTP POST (current) UDP (proposed)
Protocol overhead TCP handshake (3 packets) + HTTP headers None (single packet)
Connection setup ~10-50ms (TCP + HTTP) 0ms (connectionless)
Payload size ~100-200 bytes (JSON + headers) 80-84 bytes (binary struct)
Serialization JSON encode + parse on receiver Direct memcpy of struct
CPU usage High (string ops, JSON) Minimal (binary copy)
Reliability TCP guarantees delivery Best-effort (no guarantee)
Error handling HTTP status codes (unused in your code) None

✅ Why UDP is Better Here

  1. Already implemented: writeUDP() (lines 201-226) already sends brightness, palette, preset, and lightsOn in the UDPMessage struct
  2. Consistent protocol: Device sync already uses UDP; this unifies communication
  3. Embedded-friendly: No TCP state machine, no HTTP parsing overhead
  4. Fire-and-forget: Perfect for control commands where occasional loss is acceptable (user just clicks again)

🔧 Implementation Suggestion

Instead of the HTTP call in onUpdate() (lines 85-104), send a targeted UDP packet:

void onUpdate(const UpdatedItem& updatedItem) override {
  if (_state.updateOriginId == (String(_moduleName) + "server").c_str()) return;

  if (updatedItem.parent[0] == "devices") {
    IPAddress targetIP;
    if (!targetIP.fromString(_state.data["devices"][updatedItem.index[0]]["ip"].as<const char*>())) {
      return;  // Invalid IP
    }
    
    // Build UDP message with updated field
    UDPMessage message{};  // Zero-initialize
    message.name = esp32sveltekit.getWiFiSettingsService()->getHostname().c_str();
    message.version = APP_VERSION;
    message.packageSize = sizeof(message);
    
    _moduleLightsControl->read([&](ModuleState& state) {
      message.lightsOn = state.data["lightsOn"];
      message.brightness = state.data["brightness"];
      message.palette = state.data["palette"];
      message.preset = state.data["preset"]["selected"];
    }, _moduleName);
    
    message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
    
    // Send unicast UDP to specific device (not broadcast)
    if (deviceUDP.beginPacket(targetIP, deviceUDPPort)) {
      deviceUDP.write((uint8_t*)&message, sizeof(message));
      deviceUDP.endPacket();
      EXT_LOGD(ML_TAG, "UDP update sent to %s: %s=%d", targetIP.toString().c_str(), 
               updatedItem.name.c_str(), updatedItem.value.as<int>());
    }
  }
}

⚠️ Trade-offs

  • No delivery guarantee: UDP packets can be lost (but HTTP errors aren't handled in current code anyway)
  • No acknowledgment: Can't verify the device received/applied the change (but you can see it in the next periodic update)
  • Network issues: Routers may drop UDP before TCP in congestion (rare on local network)

Given your existing UDP infrastructure and the local network environment, UDP is the right choice here. It's faster, lighter, and more consistent with your device sync architecture.

Would you like me to help implement this change?


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 25, 2026

@coderabbitai, regarding

Instead of the HTTP call in onUpdate() (lines 85-104), send a targeted UDP packet:

Sounds good, but the receiving device should update their lights when receiving this, in my rest api solution that is already implemented. How to do this with the UDP solution?

And can web sockets also be a solution? Can web sockets also broadcast to multiple devices? And what about overhead and speed?

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== 1. Find REST API endpoints in ModuleLightsControl that handle brightness/palette/preset updates ==="
rg -n "brightn|palette|preset" src/MoonLight/Modules/ModuleLightsControl.h -B 3 -A 5 | head -100

echo -e "\n=== 2. Look for setupEndpoints or REST handler registration ==="
rg -n "setupEndpoints|registerEndpoint|on\(\"/" src/MoonLight/Modules/ModuleLightsControl.h -B 2 -A 10 | head -100

echo -e "\n=== 3. Check ModuleLightsControl.cpp for REST implementations ==="
fd -e cpp ModuleLightsControl --exec cat {} | head -200

echo -e "\n=== 4. Find how lights are updated when state changes ==="
rg -n "update.*brightness|update.*palette|update.*preset|setState|write.*state" src/MoonLight/Modules/ModuleLightsControl.h -B 2 -A 5 | head -80

echo -e "\n=== 5. Check if readUDP in ModuleDevices triggers any light updates ==="
rg -n "readUDP|updateDevices.*UDP" src/MoonBase/Modules/ModuleDevices.h -B 5 -A 30 | head -100

echo -e "\n=== 6. Look for WebSocket implementation in ESP32SvelteKit or PsychicHttp ==="
fd -e h WebSocket lib/ --exec rg -l "broadcast|send.*all|emit.*all" {} | head -5

echo -e "\n=== 7. Check if there's a StatefulService update mechanism that triggers changes ==="
fd -e h StatefulService lib/ --exec cat {} | grep -A 30 "class StatefulService" | head -50

Length of output: 13098


@ewowi, great question! Let me address both parts:

🔧 UDP Solution: Updating Lights on Receiver

Currently, readUDP() only updates the device list UI—it doesn't trigger light changes. To make the receiver update its lights, modify updateDevices() to write to ModuleLightsControl's state:

void updateDevices(const UDPMessage& message, IPAddress ip) {
  // ... existing device list update code (lines 131-165) ...
  
  device["time"] = time(nullptr);
  device["name"] = message.name.c_str();
  device["version"] = message.version.c_str();
  device["uptime"] = message.uptime;
  device["packageSize"] = message.packageSize;
  device["lightsOn"] = message.lightsOn;
  device["brightness"] = message.brightness;
  device["palette"] = message.palette;
  device["preset"] = message.preset;
  
  // ✅ NEW: If this packet is intended for THIS device (not just broadcast status),
  // update local lights control state to trigger hardware changes
  IPAddress activeIP = WiFi.isConnected() ? WiFi.localIP() : ETH.localIP();
  
  // Check if sender explicitly targeted this device (not broadcast)
  // OR check a flag in UDPMessage to distinguish control commands from status broadcasts
  if (/* add flag to UDPMessage: message.isControlCommand */) {
    _moduleLightsControl->update(
      [&](ModuleState& state) {
        state.data["lightsOn"] = message.lightsOn;
        state.data["brightness"] = message.brightness;
        state.data["palette"] = message.palette;
        state.data["preset"]["selected"] = message.preset;
        return ModuleState::update;
      },
      String(_moduleName) + "udp"  // originId to track update source
    );
    
    EXT_LOGD(ML_TAG, "Applied UDP control: bri=%d pal=%d preset=%d", 
             message.brightness, message.palette, message.preset);
  }
  
  // ... existing WebSocket update code (lines 166-185) ...
}

Key insight: The _moduleLightsControl->update() call triggers the same code path as REST API (lines 323-366 in ModuleLightsControl), which updates the hardware (brightness, palette, relay pins, etc.).

Important: You'll need to distinguish between:

  • Status broadcasts (every 10s from all devices) → Only update device list UI
  • Control commands (user changed brightness on device A → send to device B) → Update device list UI and apply to local lights

Add a uint8_t flags; field to UDPMessage with a bit like 0x01 = isControlCommand.


🌐 WebSockets: Not Suitable for Device-to-Device

WebSockets are client-server architecture, not peer-to-peer:

Aspect Device-to-Device How WebSockets Work
Topology Need A → B direct Browser ↔ ESP32 server
Broadcasting A sends to all peers Server broadcasts to all browsers
ESP32-to-ESP32 ❌ Not supported Each ESP32 is a server, not a client

Why WebSockets don't work:

  1. ESP32 runs a WebSocket server for browsers, not a WebSocket client to connect to other ESP32s
  2. To use WebSockets for device sync, you'd need:
    • Each ESP32 to run both server (for browsers) and client (for other ESP32s)
    • Implement WebSocket client in firmware (heavy overhead)
    • Manage persistent connections between all devices (15 devices = 210 connections!)
  3. WebSocket connections require TCP handshake, TLS (optional), HTTP upgrade—massive overhead vs UDP

📊 Protocol Comparison

Protocol Overhead Speed Device-to-Device Broadcasting Best For
UDP None Fastest (~1ms) ✅ Native ✅ Multicast/broadcast Device sync
HTTP REST High (TCP+HTTP) Slow (~10-50ms) ✅ Via HTTP client ❌ Must loop Browser → Device
WebSocket Medium (TCP+WS) Fast (after setup) ❌ Not P2P on ESP32 ✅ Server → browsers Device → Browsers

Verdict: UDP is the correct choice for device-to-device light control synchronization. It's what you're already using for status broadcasts—just extend it to handle control commands too.


🎯 Recommended Architecture

User changes brightness on Device A
    ↓
Device A: onUpdate() fires
    ↓
Send UDP packet with isControlCommand=true
    ↓
Device B/C/D: readUDP() receives packet
    ↓
Check isControlCommand flag
    ↓
YES → Update local lights via _moduleLightsControl->update()
    ↓
Device B/C/D: onUpdate() fires → Hardware changes

This gives you the fastest possible sync (<20ms) with minimal overhead (80 bytes per command).

Would you like me to help implement the isControlCommand flag and the full bidirectional UDP control system?


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== 1. Find REST endpoint for lights control updates ==="
rg -n "rest.*lights|lights.*endpoint|brightness.*endpoint|palette.*endpoint" src/MoonBase/Modules/ModuleDevices.h src/MoonLight/Modules/ModuleLightsControl.h -B 3 -A 10

echo -e "\n=== 2. Find ModuleLightsControl write/update methods ==="
rg -n "void.*write\(|void.*update\(" src/MoonLight/Modules/ModuleLightsControl.h -B 2 -A 15 | head -100

echo -e "\n=== 3. Check current readUDP and updateDevices flow ==="
rg -n "void.*readUDP|void.*updateDevices" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 30

echo -e "\n=== 4. Look for how _moduleLightsControl is called/used ==="
rg -n "_moduleLightsControl->|moduleLightsControl\." src/MoonBase/Modules/ModuleDevices.h -B 2 -A 5

echo -e "\n=== 5. Find StatefulService write method signature ==="
rg -n "void.*write\(|template.*write" src/ --type-add 'h:*.h' -A 10 | grep -E "write\(|StatefulService|std::function" | head -60

echo -e "\n=== 6. Check WebSocket broadcast capabilities ==="
rg -n "broadcast|sendAll|emitAll|getConnectedClients" src/MoonBase/ -B 2 -A 5 | head -80

echo -e "\n=== 7. Find ModuleLightsControl StatefulService methods ==="
ast-grep --pattern 'class ModuleLightsControl {
  $$$
}'

Length of output: 7285


🏁 Script executed:

#!/bin/bash

echo "=== 1. Find StatefulService write method in lib folder ==="
fd -e h StatefulService lib/ --exec rg -n "void.*write\(" {} -B 2 -A 15

echo -e "\n=== 2. Search for REST handler registration for lightscontrol ==="
rg -n "rest.*lightscontrol|registerHandler.*lightscontrol|setupDefinition" src/MoonLight/Modules/ModuleLightsControl.h -B 5 -A 15 | head -100

echo -e "\n=== 3. Find how Module base class handles REST updates ==="
rg -n "class Module|handleUpdate|void.*update\(" src/MoonBase/Module.h -B 2 -A 20 | head -120

echo -e "\n=== 4. Check if StatefulService has update method ==="
fd -e h StatefulService lib/ --exec cat {} | grep -E "update\(|write\(" -B 3 -A 10 | head -80

echo -e "\n=== 5. Look for how REST endpoint triggers state changes ==="
rg -n "onUpdate|StateUpdateResult|handleUpdate" src/MoonBase/Module.h -B 3 -A 15

Length of output: 7532


@ewowi, great question! Let me address both parts:

🔧 UDP Solution: Making Receiver Update Lights

The key is that readUDP() currently only updates the device list, but doesn't apply the changes to the lights. You need to call _moduleLightsControl->update() when receiving UDP packets with changed values:

void readUDP() {
  while (size_t packetSize = deviceUDP.parsePacket()) {
    if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
      deviceUDP.flush();
      continue;
    }
    
    char buffer[sizeof(UDPMessage)];
    UDPMessage message{};
    deviceUDP.read(buffer, packetSize);
    memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
    
    // ✅ Update device list (existing)
    updateDevices(message, deviceUDP.remoteIP());
    
    // ✅ NEW: Apply light changes to THIS device if fields are present
    if (message.packageSize >= 80) {  // New firmware with light control fields
      _moduleLightsControl->update(
        [&](ModuleState& state) {
          bool changed = false;
          
          if (state.data["lightsOn"].as<bool>() != message.lightsOn) {
            state.data["lightsOn"] = message.lightsOn;
            changed = true;
          }
          if (state.data["brightness"].as<uint8_t>() != message.brightness) {
            state.data["brightness"] = message.brightness;
            changed = true;
          }
          if (state.data["palette"].as<uint8_t>() != message.palette) {
            state.data["palette"] = message.palette;
            changed = true;
          }
          if (state.data["preset"]["selected"].as<uint8_t>() != message.preset) {
            state.data["preset"]["selected"] = message.preset;
            changed = true;
          }
          
          return changed ? StateUpdateResult::CHANGED : StateUpdateResult::UNCHANGED;
        },
        "udp-sync"  // Origin ID to prevent loops
      );
    }
  }
}

Important: Add a check in addUpdateHandler (line 44-48) to prevent loops:

_moduleLightsControl->addUpdateHandler(
  [this](const String& originId) {
    if (originId == "udp-sync") return;  // ✅ Don't broadcast UDP changes back over UDP
    writeUDP(false);  // send update over the network
  },
  false);

🌐 WebSocket Alternative Analysis

WebSockets cannot replace UDP for device-to-device sync because:

Architecture Comparison

UDP (Peer-to-Peer) WebSocket (Client-Server)
Flow: Device A → broadcast → all devices listen Flow: Device A → server → server broadcasts to A, B, C
Topology: Decentralized (mesh) Topology: Centralized (star)
Server requirement: None Server requirement: One device must be "master"
Failure mode: Each device independent Failure mode: If master fails, all sync breaks
Code location: Already exists in ModuleDevices Code location: Would need new server role + client connections

Performance Comparison

Metric UDP WebSocket REST (current)
Latency ~1-5ms ~5-20ms ~50-200ms
Connection setup 0ms (connectionless) ~50-100ms (TCP + WS handshake) ~10-50ms per request
Overhead per message 8 bytes (UDP header) + 80 bytes (struct) = 88 bytes 8 (UDP) + 20 (TCP) + 2-14 (WS frame) + 80 (data) = ~120 bytes TCP/HTTP headers + JSON = ~300 bytes
CPU encoding memcpy (binary) WebSocket framing + JSON HTTP + JSON
CPU decoding memcpy (binary) WebSocket parsing + JSON HTTP + JSON
Broadcast Native (255.255.255.255) Server must loop through clients N/A (unicast only)
State management Stateless Stateful (connection tracking) Stateless
Memory per connection ~100 bytes (socket) ~8KB (TCP buffers + WS state) ~4KB (HTTP transaction)

WebSocket Code Example (for reference only)

If you wanted WebSockets (not recommended):

// On sender device (would need to be connected as WebSocket client to other devices)
void sendWebSocketUpdate(IPAddress targetIP) {
  // ❌ Problem: ESP32 would need WebSocket CLIENT library to connect to other ESP32s
  // ❌ Problem: Each device needs to maintain persistent connections to all 15 devices
  // ❌ Problem: 15 × 8KB = 120KB RAM just for connection buffers
  // ❌ Problem: Who is server? Who is client? Need master election logic
}

✅ Recommendation

Stick with UDP for device-to-device sync because:

  1. Already implemented: Your infrastructure is UDP-based
  2. Lower latency: 1-5ms vs 5-20ms (WebSocket) vs 50-200ms (REST)
  3. Lower overhead: 88 bytes vs 120 bytes (WS) vs 300 bytes (REST)
  4. Decentralized: No single point of failure
  5. Simpler: No connection management, master election, or reconnection logic
  6. Less memory: ~100 bytes vs ~8KB per "connection"

The only advantage of WebSocket would be guaranteed delivery, but:

  • Your current REST solution doesn't verify delivery either
  • Local network UDP packet loss is <0.01%
  • Users can just click again if a change doesn't apply

TL;DR: Use the UDP solution code above. It's faster, simpler, and fits your existing architecture perfectly. WebSockets would add complexity without meaningful benefits for this use case. 🎯


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

backend
=======
Devices:
- add UDPMessage.isControlCommand
- rename writeUDP to sendUDP and readUDP to receiveUDP
- onUpdate: call sendUDP(targetIP)
- receiveUDP: packageSize, device.clear, bufferSize, if isControlCommand then processCommand
- sendUDP: targetIP argument, set isControl if targetIP
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 76-86: In onUpdate (method onUpdate / symbol updatedItem) add
guards to ensure updatedItem.parent and updatedItem.index are non-empty before
accessing [0]; specifically check updatedItem.parent.empty() and
updatedItem.index.empty() (or sizes) and return early if either is empty, then
proceed to read _state.data["devices"][updatedItem.index[0]]["ip"] and call
sendUDP(false, targetIP) as before; this prevents UB when indexing parent/index
and keeps the existing logic in onUpdate intact.
- Around line 39-49: The constructor currently accepts Module* moduleControl and
dereferences _moduleControl (e.g., in ModuleDevices constructor and when calling
_moduleControl->addUpdateHandler and later sendUDP callbacks) without null
checks; change the API to enforce non-null by taking Module& moduleControl
(update the ModuleDevices constructor signature and assignment to _moduleControl
to a reference) or, if pointer API must remain, validate moduleControl at start
of the ModuleDevices constructor and fail-fast (log error/throw/assert) when
null before assigning to _moduleControl and before calling addUpdateHandler;
ensure all uses reference the validated symbol (ModuleDevices constructor,
_moduleControl, addUpdateHandler, sendUDP) so no null dereference can occur.
- Around line 166-195: receiveUDP currently calls updateDevices() for every UDP
packet which can flood WebSocket sends; modify receiveUDP (and related
updateDevices usage) to coalesce or debounce broadcasts: inside receiveUDP
accumulate incoming UDPMessage changes into a single aggregated state (or track
a “dirty” flag and last-sent values) while draining the socket, then after the
receive loop call updateDevices(...) once (or call it at most once per X ms)
only if the aggregated state differs from the last-emitted state; also ensure
control-path using _moduleControl->update(...) similarly only triggers when
state actually changed to avoid redundant WebSocket pushes.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)

216-217: uptime currently stores boot timestamp, not uptime.

With NTP set, time(nullptr) - pal::millis()/1000 yields boot time; the field name says uptime. Either rename the field or store uptime directly.

🔧 Suggested fix (uptime seconds)
-      message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
+      message.uptime = pal::millis() / 1000;

backend
=======
Devices:
- onUpdate: send UDP message using updatedItem
- receiveUDP: set preset action and select
- sendUDP: remove targetIP
@ewowi
Copy link
Collaborator Author

ewowi commented Jan 26, 2026

@coderabbitai, can you review latest commit 2fdf976

@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Fetch commit details ==="
git show 2fdf97647d3bab308bc336a560bfd09e969b3da1 --stat

echo -e "\n=== Full diff of the commit ==="
git show 2fdf97647d3bab308bc336a560bfd09e969b3da1

Length of output: 5003


@ewowi, excellent implementation! 🎉 This commit successfully implements the UDP-based device control flow we discussed. Here's my detailed review:

✅ What's Working Great

1. Control Command Flow (Lines 79-106)

Perfect implementation of unicast UDP control messages:

message.isControlCommand = true;
deviceUDP.write((uint8_t*)&message, sizeof(message));
  • ✅ Zero-initialized struct prevents garbage data
  • ✅ Uses sizeof(message) for safe buffer sizing
  • ✅ Builds message from device state directly
  • ✅ Logs sent values for debugging

2. Loop Prevention (Lines 77 & 212)

Smart originID-based loop prevention:

  • Line 77: if (_state.updateOriginId == (String(_moduleName) + "server").c_str()) return;
  • Line 212: _moduleControl->update(newState, ModuleState::update, String(_moduleName) + "server");

This ensures Device B doesn't re-broadcast when applying a control command from Device A ✓

3. Broadcast Simplification (Lines 220-242)

Clean separation of concerns:

  • sendUDP(false) → Broadcast status (isControlCommand=false)
  • onUpdate → Unicast control (isControlCommand=true)

⚠️ Recommendations

1. Preset Field Naming (Line 209)

newState["preset"]["select"] = message.preset;  // ← Is this correct?

Should this be preset["selected"] to match the typical state structure? Or does ModuleLightsControl expect preset["select"]? The commit message says "preset action fix", so this might be intentional, but please verify it matches ModuleLightsControl's expected schema.

2. Struct Packing (Critical for Mixed Firmware)

Add to UDPMessage struct definition:

struct UDPMessage {
  uint8_t rommel[6];
  Char<32> name;
  Char<32> version;
  uint32_t uptime;
  uint16_t packageSize;
  bool lightsOn;
  uint8_t brightness;
  uint8_t palette;
  uint8_t preset;
  bool isControlCommand;
} __attribute__((packed));  // ✅ Add this to prevent alignment issues

Without this, ESP32-S3 vs ESP32-P4 may have different struct layouts (77-84 bytes) causing the crashes you experienced.

3. Buffer Validation in receiveUDP

Verify the loop includes size validation:

while (size_t packetSize = deviceUDP.parsePacket()) {
  if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
    deviceUDP.flush();
    continue;
  }
  // ... rest of code
}

This prevents the buffer overflow that caused WebSocket crashes when you increased buffer to 200.

4. Unnecessary Broadcasts (Minor Optimization)

When Device B receives a control command and applies it, the addUpdateHandler (registered by ModuleDevices) will trigger sendUDP(false), broadcasting the change to all devices. This is harmless (just updates device list UI) but inefficient. Consider:

_moduleLightsControl->addUpdateHandler(
  [this](const String& originId) {
    if (originId == (String(_moduleName) + "server")) return;  // Skip broadcast for UDP-sourced changes
    sendUDP(false);
  },
  false);

📊 Architecture Summary

Your implementation now achieves:

User changes brightness on Device A
    ↓
onUpdate() fires → builds UDPMessage (isControlCommand=true)
    ↓
Unicast UDP to Device B (targetIP)
    ↓
Device B: receiveUDP() → isControlCommand=true
    ↓
Apply to _moduleControl (originID="devicesserver")
    ↓
Device B: Lights change ✓
    ↓
onUpdate() returns early (loop prevention) ✓

This is exactly the right design for fast (<20ms), lightweight device synchronization! 🎯


Overall: Strong commit that implements the UDP control flow correctly. Just verify the preset field naming and add struct packing to prevent the mixed-firmware crashes. The architecture is solid! 🚀


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

backend
=======
- Module: add mutexedOriginId, add originId to checkReOrderSwap, compareRecursive, postUpdate, processUpdatedItem, onUpdate
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/MoonBase/Module.h`:
- Around line 47-52: mutexedOriginId currently stores a pointer to a String that
may be a temporary and thus can dangle; change the update queue to store a safe
copy of the origin id (e.g., replace static const String* mutexedOriginId with a
String mutexedOriginId or add a String member on UpdatedItem) and ensure
postUpdate() writes the copied String into mutexedUpdatedItem/UpdatedItem and
getUpdate() reads that copy; update any uses in postUpdate(), getUpdate(), and
the UpdatedItem structure so no call sites pass/persist raw pointers to
temporaries.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)

99-100: Uptime calculation still yields boot time, not uptime.

If the field is meant to be uptime seconds, time(nullptr) - pal::millis()/1000 reports the boot epoch (and can underflow before time sync). Prefer monotonic uptime (or rename the field if boot time is intended).

🔧 Suggested fix (if uptime is intended)
-        message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
+        message.uptime = pal::millis() / 1000;
-      message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
+      message.uptime = pal::millis() / 1000;

Also applies to: 237-238

🧹 Nitpick comments (1)
src/MoonBase/Modules/ModuleIO.h (1)

576-597: Factor the origin check into a single boolean

Repeated originId.endsWith("server") calls add noise and redo string scanning. Cache once for clarity and consistency.

♻️ Suggested refactor
   void onUpdate(const UpdatedItem& updatedItem, const String& originId) override {
     JsonDocument doc;
     JsonObject object = doc.to<JsonObject>();
-    if (updatedItem.name == "boardPreset" && !originId.endsWith("server")) {  // not done by this module: done by UI
+    const bool isServerOrigin = originId.endsWith("server");
+    if (updatedItem.name == "boardPreset" && !isServerOrigin) {  // not done by this module: done by UI
       // if booting and modded is false or ! booting
       if ((updatedItem.oldValue == "" && _state.data["modded"] == false) || updatedItem.oldValue != "") {  // only update unmodded
         EXT_LOGD(MB_TAG, "%s %s[%d]%s[%d].%s = %s -> %s", originId.c_str(), updatedItem.parent[0].c_str(), updatedItem.index[0], updatedItem.parent[1].c_str(), updatedItem.index[1], updatedItem.name.c_str(), updatedItem.oldValue.c_str(), updatedItem.value.as<String>().c_str());
         newBoardID = updatedItem.value;  // run in sveltekit task
       }
-    } else if (updatedItem.name == "modded" && !originId.endsWith("server")) {  // not done by this module: done by UI
+    } else if (updatedItem.name == "modded" && !isServerOrigin) {  // not done by this module: done by UI
       // set pins to default if modded is turned off
       if (updatedItem.value == false) {
         EXT_LOGD(MB_TAG, "%s[%d]%s[%d].%s = %s -> %s", updatedItem.parent[0].c_str(), updatedItem.index[0], updatedItem.parent[1].c_str(), updatedItem.index[1], updatedItem.name.c_str(), updatedItem.oldValue.c_str(), updatedItem.value.as<String>().c_str());
         newBoardID = _state.data["boardPreset"];  // run in sveltekit task
       }
-    } else if ((updatedItem.name == "switch1" || updatedItem.name == "switch2") && !originId.endsWith("server")) {
+    } else if ((updatedItem.name == "switch1" || updatedItem.name == "switch2") && !isServerOrigin) {
       // rebuild with new switch setting
       newBoardID = _state.data["boardPreset"];  // run in sveltekit task
-    } else if (updatedItem.name == "maxPower" && !originId.endsWith("server")) {
+    } else if (updatedItem.name == "maxPower" && !isServerOrigin) {
       object["modded"] = true;
-    } else if (updatedItem.name == "usage" && !originId.endsWith("server")) {  // not done by this module: done by UI
+    } else if (updatedItem.name == "usage" && !isServerOrigin) {  // not done by this module: done by UI
       object["modded"] = true;
     }

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 26, 2026

@coderabbitai, occasionally the system gots stuck on _accessMutex when I update brightness, preset, palette or lightsOn. _accessMutex is used in the ModuleState update() and read() function.

Can you do a thorough review on the devices module how that can happen? Are there situations that a read() or update() is in progress and another one from another process is starting? onUpdate() and loop() are both done by the sveltekit task (please verify). or do other tasks kick in on the _accessMutex of devices?

In general, please do a review of moduleDevices with all the commits added in this PR and provide a flow diagram showing the different scenarios and processes / tasks: receive a device message (every 10s + updates) per device, send the own device messages (every 10s + updates), also verify / proof no racing conditions can occur: eg. update a device -> send -> receive device update -> update UI (updateDevices) etc.

Logging when _accessMutex is taken - this goes on repeatedly and the device becomes non responding requiring a restart:

[ 24131][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=24 preset=15
[ 24256][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.188 bri=0 pal=0 preset=0
[ 27857][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=43 preset=15
[ 41386][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 41906][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 42422][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 42532][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=8 preset=15
[ 43086][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 43606][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 44122][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 44643][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 45163][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 45679][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 46200][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues

@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== 1. Get current ModuleDevices.h full implementation ==="
cat src/MoonBase/Modules/ModuleDevices.h

echo -e "\n\n=== 2. Get StatefulService mutex implementation ==="
fd StatefulService.h lib/ --exec cat {}

echo -e "\n\n=== 3. Find which task runs loop20ms and loop10s ==="
rg -n "loop20ms|loop10s" src/main.cpp -B 5 -A 5

echo -e "\n\n=== 4. Check Module.h for update/onUpdate flow ==="
cat src/MoonBase/Module.h | head -200

echo -e "\n\n=== 5. Find all _accessMutex usage patterns ==="
rg -n "_accessMutex" src/MoonBase/ -B 3 -A 3

echo -e "\n\n=== 6. Check if updateDevices calls update() ==="
rg -n "void updateDevices" src/MoonBase/Modules/ModuleDevices.h -A 50

echo -e "\n\n=== 7. Find addUpdateHandler implementation ==="
rg -n "addUpdateHandler" src/MoonBase/Module.h -B 5 -A 20

Length of output: 27998


🏁 Script executed:

#!/bin/bash

echo "=== 8. Check if receiveUDP is called from loop or different context ==="
rg -n "receiveUDP" src/MoonBase/Modules/ModuleDevices.h -B 5 -A 5

echo -e "\n\n=== 9. Check ModuleLightsControl update handler ==="
rg -n "moduleDevices.*addUpdateHandler|addUpdateHandler.*moduleDevices" src/ -B 3 -A 10

echo -e "\n\n=== 10. Find task creation and priority assignments ==="
rg -n "xTaskCreate|createTask|Task.*priority" src/main.cpp | head -40

echo -e "\n\n=== 11. Check for nested update() calls ==="
rg -n "->update\(|\.update\(" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 5

echo -e "\n\n=== 12. Check beginTransaction and endTransaction in StatefulService ==="
fd StatefulService.h lib/ --exec rg -n "beginTransaction|endTransaction" {} -B 2 -A 15

Length of output: 5316


🏁 Script executed:

#!/bin/bash

echo "=== 13. Check if onUpdate can be called during update() ==="
cat src/MoonBase/Module.cpp | grep -A 30 "void Module::processUpdatedItem"

echo -e "\n\n=== 14. Find all places where devices module state is accessed ==="
rg -n "moduleDevices\._state|moduleDevices->_state" src/ -B 2 -A 2

echo -e "\n\n=== 15. Check update() signature and where originId comes from ==="
rg -n "ModuleState::update.*originId" src/MoonBase/Module.h -B 3 -A 20

echo -e "\n\n=== 16. Verify no recursive update scenarios ==="
rg -n "processUpdatedItem.*update\(|update\(.*processUpdatedItem" src/MoonBase/ -B 3 -A 3

Length of output: 248


@ewowi, I've found the root cause of your mutex deadlock! This is a critical issue in your codebase. Let me explain with diagrams and solutions.

🔴 Root Cause: Critical Bug in StatefulService.h

Line 204-205 in StatefulService.h contains a DANGEROUS bug:

ESP_LOGW("🐼", "_accessMutex %s waited 500ms and continues", originId.c_str());
// xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);  // ← This line is COMMENTED OUT!

After the 500ms timeout, the code proceeds WITHOUT acquiring the mutex! This causes:

  1. Thread A holds mutex for >500ms
  2. Thread B times out and continues anyway without mutex protection
  3. Race condition corrupts shared state (JsonDocument operations are NOT thread-safe!)
  4. Corruption causes further hangs or crashes
  5. System becomes unresponsive

📊 Flow Diagrams

Normal Flow (Working)

sequenceDiagram
    participant User
    participant SvelteKit Task
    participant ModuleDevices
    participant ModuleLightsControl
    participant UDP

    User->>SvelteKit Task: Change brightness
    SvelteKit Task->>ModuleLightsControl: update()
    Note over ModuleLightsControl: Acquires mutex
    ModuleLightsControl->>ModuleLightsControl: Update state
    Note over ModuleLightsControl: Releases mutex
    ModuleLightsControl->>ModuleDevices: Call update handler: sendUDP(false)
    ModuleDevices->>ModuleLightsControl: read() to get state
    Note over ModuleLightsControl: Acquires mutex (recursive OK)
    ModuleLightsControl-->>ModuleDevices: Return state
    Note over ModuleLightsControl: Releases mutex
    ModuleDevices->>UDP: Broadcast packet
Loading

Deadlock Scenario (Bug Trigger)

sequenceDiagram
    participant loop20ms
    participant receiveUDP
    participant updateDevices
    participant ModuleDevices State
    participant UDP Buffer

    Note over UDP Buffer: 5+ packets queued<br/>(own broadcasts + other devices)
    
    loop20ms->>receiveUDP: Process packets
    receiveUDP->>receiveUDP: while(parsePacket())
    
    Note over receiveUDP: Packet 1: Control command
    receiveUDP->>ModuleLightsControl: update() for control
    ModuleLightsControl->>ModuleDevices: Trigger sendUDP(false)
    ModuleDevices->>UDP Buffer: Broadcast (adds to queue)
    
    Note over receiveUDP: Packet 2: Status from Device B
    receiveUDP->>updateDevices: Process status
    updateDevices->>ModuleDevices State: update() - Acquires mutex
    Note over ModuleDevices State: Sort 15 devices<br/>Build JSON<br/>Update WebSocket
    
    Note over receiveUDP: Packet 3: Own broadcast (loopback)
    receiveUDP->>updateDevices: Process own packet AGAIN
    updateDevices->>ModuleDevices State: update() - Try acquire mutex (RECURSIVE)
    Note over ModuleDevices State: Nested update!
    
    Note over receiveUDP: Packets 4, 5, 6... keep nesting
    receiveUDP->>updateDevices: Keep calling updateDevices
    updateDevices->>ModuleDevices State: Deeper recursion
    
    Note over ModuleDevices State: After several nested calls:<br/>- JSON corruption from unprotected access<br/>- Stack near limit<br/>- Mutex held for >500ms
    
    par Concurrent access attempts
        loop20ms->>receiveUDP: Next 20ms tick
        receiveUDP->>updateDevices: New packet
        updateDevices->>ModuleDevices State: Try acquire mutex
        Note over ModuleDevices State: TIMEOUT 500ms
        Note over ModuleDevices State: BUG: Continues without mutex!
        Note over ModuleDevices State: Corrupts JSON while other<br/>thread still processing
    end
    
    Note over ModuleDevices State: System deadlocked<br/>Repeated timeouts<br/>Data corruption
Loading

🔍 Specific Issues Identified

Issue 1: Self-Broadcast Loop

Line 236 in sendUDP(true):

deviceUDP.beginPacket(IPAddress(255, 255, 255, 255), deviceUDPPort);  // Broadcast

This broadcast is received by the same device via loopback! The while loop in receiveUDP() processes it, calling updateDevices() again, which triggers more handlers, creating a circular call chain.

Issue 2: No Loop Prevention in receiveUDP

Line 186-219: The receiveUDP() function processes ALL packets including ones sent by itself:

while (size_t packetSize = deviceUDP.parsePacket()) {
    // ... no check if message.name == own hostname!
    if (message.isControlCommand) {
        _moduleControl->update(newState, ModuleState::update, String(_moduleName) + "server");
    } else
        updateDevices(message, deviceUDP.remoteIP());  // ← Processes own broadcasts!
}

Issue 3: updateDevices Holds Mutex Too Long

Lines 168-183: Sorting and JSON operations while holding mutex:

void updateDevices(const UDPMessage& message, IPAddress ip) {
    // ... build devices array ...
    
    std::vector<JsonObject> devicesVector;
    for (JsonObject dev : devices) {  // ← Iteration
        if (time(nullptr) - dev["time"].as<time_t>() < 86400) devicesVector.push_back(dev);
    }
    std::sort(devicesVector.begin(), devicesVector.end(), ...);  // ← Sorting!
    
    // ... build sorted JSON ...
    
    update(controls, ModuleState::update, String(_moduleName) + "server");  // ← Acquires mutex
    // All the above work is done INSIDE the update() call's mutex hold time!
}

With 15 devices and nested calls, this can take >500ms!

✅ Solutions

Solution 1: FIX THE CRITICAL BUG (Mandatory!)

In lib/.../StatefulService.h line 204-205, either:

Option A: Wait forever (recommended for debugging):

if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(100))==pdFALSE) {
    if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
        ESP_LOGW("🐼", "_accessMutex %s waited 500ms, waiting indefinitely...", originId.c_str());
        xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);  // ✅ UNCOMMENT THIS!
    }
}

Option B: Fail-fast with error (better for production):

if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(500))==pdFALSE) {
    ESP_LOGE("🐼", "_accessMutex %s DEADLOCK DETECTED! Rebooting...", originId.c_str());
    esp_restart();  // Better than continuing with corrupted state
}

Solution 2: Prevent Self-Broadcast Processing

In ModuleDevices.h line 186, add origin check:

void receiveUDP() {
    String ownHostname = esp32sveltekit.getWiFiSettingsService()->getHostname();
    
    while (size_t packetSize = deviceUDP.parsePacket()) {
        if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
            EXT_LOGW(MB_TAG, "Invalid UDP packet size: %d (expected %d-%d)", packetSize, 38, sizeof(UDPMessage));
            deviceUDP.clear();
            continue;
        }

        char buffer[sizeof(UDPMessage)];
        UDPMessage message{};
        deviceUDP.read(buffer, packetSize);
        memcpy(&message, buffer, packetSize);
        
        // ✅ Skip own broadcasts to prevent loops
        if (message.name == ownHostname) {
            EXT_LOGV(ML_TAG, "Skipping own broadcast");
            continue;
        }

        if (message.isControlCommand) {
            // ... existing code ...
        } else {
            updateDevices(message, deviceUDP.remoteIP());
        }
    }
}

Solution 3: Optimize updateDevices to Reduce Mutex Hold Time

Move expensive operations outside the mutex:

void updateDevices(const UDPMessage& message, IPAddress ip) {
    // Build and sort BEFORE calling update()
    std::vector<JsonObject> devicesVector;
    JsonDocument doc2;
    
    // Read current state
    {
        if (_state.data["devices"].isNull()) _state.data["devices"].to<JsonArray>();
        JsonArray devices = _state.data["devices"];
        
        // Find or create device entry
        JsonObject device = JsonObject();
        for (JsonObject dev : devices) {
            if (dev["ip"] == ip.toString()) {
                device = dev;
                break;
            }
        }
        if (device.isNull()) {
            device = devices.add<JsonObject>();
            EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], message.name);
            device["ip"] = ip.toString();
        }
        
        // Update fields
        device["time"] = time(nullptr);
        device["name"] = message.name.c_str();
        device["version"] = message.version.c_str();
        device["uptime"] = message.uptime;
        device["packageSize"] = message.packageSize;
        device["lightsOn"] = message.lightsOn;
        device["brightness"] = message.brightness;
        device["palette"] = message.palette;
        device["preset"] = message.preset;
        
        if (!_socket->getConnectedClients()) return;  // No WebSocket clients
        
        // Copy for sorting (outside critical section)
        for (JsonObject dev : devices) {
            if (time(nullptr) - dev["time"].as<time_t>() < 86400) {
                devicesVector.push_back(dev);
            }
        }
    }
    
    // ✅ Sort OUTSIDE the mutex-protected update() call
    std::sort(devicesVector.begin(), devicesVector.end(), 
              [](JsonObject a, JsonObject b) { return a["name"] < b["name"]; });
    
    // Build sorted JSON
    doc2["devices"].to<JsonArray>();
    for (JsonObject device : devicesVector) {
        doc2["devices"].add(device);
    }
    
    // ✅ Now update with pre-built, sorted data (fast operation)
    JsonObject controls = doc2.as<JsonObject>();
    update(controls, ModuleState::update, String(_moduleName) + "server");
}

Solution 4: Add originId to UDPMessage

Prevent update loops by tracking who triggered the update:

struct UDPMessage {
    uint8_t rommel[6];
    Char<32> name;
    Char<32> originDevice;  // ✅ Add this
    Char<32> version;
    uint32_t uptime;
    uint16_t packageSize;
    bool lightsOn;
    uint8_t brightness;
    uint8_t palette;
    uint8_t preset;
    bool isControlCommand;
} __attribute__((packed));

Then use it to prevent circular updates in onUpdate (line 77).

📋 Task/Thread Verification

Confirmed: All ModuleDevices operations run in SvelteKit task (from main.cpp lines 371, 403):

  • loop20ms() → receiveUDP()
  • loop10s() → sendUDP(true)
  • Module::loop() → processes updates

No cross-task conflicts, but the while loop in receiveUDP() creates deep recursion when processing multiple queued packets.

🎯 Recommended Action Plan

  1. IMMEDIATE: Apply Solution 1 (fix StatefulService bug) - this is critical!
  2. HIGH PRIORITY: Apply Solution 2 (skip own broadcasts) - prevents the main loop
  3. MEDIUM PRIORITY: Apply Solution 3 (optimize updateDevices) - reduces mutex hold time
  4. OPTIONAL: Apply Solution 4 (add originDevice) - extra safety

After applying solutions 1 and 2, the deadlock should disappear completely.

Would you like me to generate the complete corrected code for ModuleDevices.h?


✏️ Learnings added
Learnt from: ewowi
URL: https://github.com/MoonModules/MoonLight/pull/114

Timestamp: 2026-01-26T14:59:22.069Z
Learning: In MoonModules/MoonLight PR `#114`: ModuleDevices deadlock caused by: (1) StatefulService.h line 204-205 bug - after 500ms mutex timeout, continues WITHOUT acquiring mutex, causing race conditions and JSON corruption. (2) receiveUDP() while-loop processes own UDP broadcasts via loopback, creating circular call chain: sendUDP→broadcast→receiveUDP→updateDevices→update→handlers→sendUDP. (3) updateDevices() holds _accessMutex during expensive sort/JSON operations with 15 devices, can exceed 500ms with nested calls. Solution: Fix StatefulService to wait forever or reboot on timeout; skip own broadcasts by checking message.name == ownHostname; optimize updateDevices to sort outside mutex.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-15T12:38:33.598Z
Learning: In MoonLight GameOfLifeEffect (src/MoonLight/Nodes/Effects/E_MoonModules.h): The effect's nested loops (especially lines 257-323) iterate through all cells (e.g., 128×128 = 16,384) with nested neighbor checks (18 per cell = 294,912 operations) without yielding. This starves effectTask (priority 3) when higher-priority tasks like HTTP server (priority 5) run, causing the effect to disappear. UI actions (save, refresh) temporarily show one frame because HTTP completes and briefly releases CPU back to effectTask. Solution: Add taskYIELD() inside outer X loop to prevent task starvation. This is the same root cause as issue `#102` (Starsky glitches with many driver nodes).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 114
File: src/MoonBase/Modules/ModuleDevices.h:39-49
Timestamp: 2026-01-26T12:58:42.286Z
Learning: In MoonModules/MoonLight: Do not add null checks for essential components like `moduleControl` in ModuleDevices or similar core dependencies. A null value for these essential components indicates a programming error (bug in initialization/calling code) that should fail-fast rather than be masked with defensive checks. This aligns with the "don't mask programming errors" philosophy used throughout the codebase.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 104
File: src/MoonLight/Layers/PhysicalLayer.cpp:145-147
Timestamp: 2026-01-13T15:18:48.740Z
Learning: In MoonLight's PhysicalLayer and Node architecture (src/MoonLight/Layers/PhysicalLayer.cpp, src/MoonBase/Nodes.h): effectsMutex, driversMutex, and layerMutex are mandatory for system operation. These mutexes are validated at creation time (constructor checks log errors if xSemaphoreCreateMutex returns nullptr). Runtime null checks on these mutex pointers are intentionally omitted—the design treats successful mutex allocation as an invariant after initialization, placing responsibility on the programmer to ensure proper allocation. This is a performance-oriented "fail-fast at boot" approach rather than defensive runtime checking.
<!-- </add_learning>

@ewowi
Copy link
Collaborator Author

ewowi commented Jan 26, 2026

@coderabbitai, regarding

Recommended Action Plan
IMMEDIATE: Apply Solution 1 (fix StatefulService bug) - this is critical!
)HIGH PRIORITY: Apply Solution 2 (skip own broadcasts) - prevents the main loop
MEDIUM PRIORITY: Apply Solution 3 (optimize updateDevices) - reduces mutex hold time
OPTIONAL: Apply Solution 4 (add originDevice) - extra safety
After applying solutions 1 and 2, the deadlock should disappear completely.

To be honest, I think none of this is true:

  1. _accessMutex devicesserver waited 500ms and continues should never happen if coding is done right, this is a warning system for the developer, and it continues to make better debugging possible. Also: never do a restart if an error occurs! never! add that as a learning. Only the user may trigger a restart.

  2. I don't think the own broadcast is looped back with udp broadcast, I added a check to verify in my local code and I do not see it, also didn't saw it in the past, please verify.

  3. There is no mutex hold in updateDevices, except at the end when update() is called, your optimizations do not change that.

  4. see 2)

The problem occurs when updating a number of times, I suspect that it goes wrong if another task interferes, e.g. a 10s update comes in, so look again.

Side issue: I only update to 192.168.1.132 but occasionally an update for 192.168.1.189 seems to be triggered, I don't know why. Note this seems to come from a device which does not have the code of this PR installed, so the message size is 38 or 44 bytes and the isControlMessage is not set in these devices, could they be set random (not 50% much lower as it happens occasionally). Don't know if this is related to above issue?

[ 41196][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=49 preset=15
[ 42901][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=112 pal=49 preset=15
[ 44782][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=49 preset=15
[ 46222][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=218 pal=49 preset=15
[ 46275][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=218 pal=49 preset=15
[ 46318][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.189 bri=0 pal=0 preset=0
[ 47045][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 47566][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 48082][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues
[ 48604][W][StatefulService.h:204] beginTransaction(): [🐼] _accessMutex devicesserver waited 500ms and continues

@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@ewowi, you're absolutely right - I apologize for the incorrect analysis. Let me investigate the real issue.


🏁 Script executed:

#!/bin/bash

echo "=== 1. Check UDPMessage zero-initialization in receiveUDP ==="
rg -n "UDPMessage message" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 8

echo -e "\n=== 2. Check if loop10s (sendUDP broadcast) runs on different task ==="
rg -n "loop10s|loop20ms" src/main.cpp -B 10 -A 3

echo -e "\n=== 3. Find all places that call update() on ModuleDevices state ==="
rg -n "moduleDevices.*update\(|_state.*update\(" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 2

echo -e "\n=== 4. Check for concurrent read() and update() patterns ==="
cat src/MoonBase/Modules/ModuleDevices.h | grep -n "void onUpdate\|void updateDevices\|void receiveUDP\|void sendUDP" -A 30 | head -150

echo -e "\n=== 5. Investigate the 192.168.1.189 spurious update ==="
# Check the onUpdate logic for device iteration
rg -n "for.*devices\|devices\[.*\]" src/MoonBase/Modules/ModuleDevices.h -B 2 -A 5

echo -e "\n=== 6. Check StatefulService update() for recursive mutex behavior ==="
fd StatefulService.h lib/ --exec cat {} | grep -n "StateUpdateResult update\|beginTransaction\|endTransaction" -B 2 -A 15 | head -100

echo -e "\n=== 7. Check if ModuleDevices read() is called during update() ==="
rg -n "_moduleLightsControl->read\(" src/MoonBase/Modules/ModuleDevices.h -B 5 -A 5

Length of output: 13554


🏁 Script executed:

#!/bin/bash

echo "=== 8. Check the exact flow when isControlCommand is true ==="
rg -n "isControlCommand" src/MoonBase/Modules/ModuleDevices.h -B 8 -A 8

echo -e "\n=== 9. Look at Module update propagation and handler calling ==="
cat src/MoonBase/Module.cpp | grep -n "processUpdatedItem\|onUpdate\|addUpdateHandler" -B 3 -A 10 | head -120

echo -e "\n=== 10. Check if updateDevices can be called while holding another mutex ==="
rg -n "void updateDevices" src/MoonBase/Modules/ModuleDevices.h -A 30 | grep -E "mutex|lock|Mutex|beginTransaction|endTransaction"

echo -e "\n=== 11. Trace the exact scenario: user changes brightness ==="
# What happens when user changes brightness on UI?
# 1. ModuleLightsControl->update() called
# 2. This triggers addUpdateHandler callbacks
# 3. ModuleDevices onUpdate() fires
# 4. Reads from ModuleLightsControl and sends UDP
# Let's verify this chain

rg -n "addUpdateHandler" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 15

Length of output: 6229


🏁 Script executed:

#!/bin/bash

echo "=== 12. Critical: Check if onUpdate calls read() which might conflict ==="
# onUpdate line 77-106 reads from _moduleLightsControl
# Does _moduleLightsControl->read() internally call update on ModuleDevices?

cat src/MoonBase/Modules/ModuleDevices.h | sed -n '77,106p'

echo -e "\n=== 13. Check Module::read() implementation ==="
rg -n "void read\(|template.*read" src/MoonBase/Module.h -B 2 -A 15 | head -60

echo -e "\n=== 14. Check for any WebSocket emission during updateDevices ==="
rg -n "_socket|emit|broadcast" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 3

echo -e "\n=== 15. Verify packet size validation logic ==="
rg -n "packetSize.*38\|packetSize.*sizeof" src/MoonBase/Modules/ModuleDevices.h -B 3 -A 10

Length of output: 3855


@ewowi, you're absolutely right to challenge my analysis! I apologize for the incorrect conclusions. Let me investigate the real issue.

After carefully tracing through the code and task execution, I found the actual root cause: a critical race condition between the HTTP server task and the SvelteKit task.

🔴 Real Root Cause: Unprotected State Access

The bug is in updateDevices() lines 132-165 - these lines modify _state.data without holding the mutex:

void updateDevices(const UDPMessage& message, IPAddress ip) {
    // ❌ Lines 132-165: Modifying _state.data WITHOUT mutex protection!
    if (_state.data["devices"].isNull()) _state.data["devices"].to<JsonArray>();
    
    JsonArray devices;
    JsonDocument doc;
    if (_socket->getConnectedClients()) {
        doc.set(_state.data);  // copy
        devices = doc["devices"];
    } else
        devices = _state.data["devices"];  // ❌ Direct reference!
    
    JsonObject device = JsonObject();
    for (JsonObject dev : devices) {  // ❌ Iteration without mutex
        if (dev["ip"] == ip.toString()) {
            device = dev;
        }
    }
    if (device.isNull()) {
        device = devices.add<JsonObject>();  // ❌ Modification without mutex
        device["ip"] = ip.toString();
    }
    
    device["time"] = time(nullptr);  // ❌ All these modifications unprotected
    device["name"] = message.name.c_str();
    device["version"] = message.version.c_str();
    // ... etc
    
    // ✅ Only here does it acquire mutex (line 177)
    update(controls, ModuleState::update, String(_moduleName) + "server");
}

📊 The Race Condition

sequenceDiagram
    participant HTTP Task (Priority 5)
    participant SvelteKit Task (Priority 3)
    participant ModuleDevices State
    participant ModuleLightsControl

    Note over HTTP Task: User changes brightness in UI
    HTTP Task->>ModuleLightsControl: update() via REST API
    Note over ModuleLightsControl: Acquires ModuleLightsControl mutex
    ModuleLightsControl->>ModuleLightsControl: Updates state
    Note over ModuleLightsControl: Releases mutex
    ModuleLightsControl->>HTTP Task: Calls updateHandler (line 45-49)
    
    par Concurrent execution
        HTTP Task->>HTTP Task: onUpdate() starts (line 76)
        HTTP Task->>ModuleDevices State: Read devices[index] (line 80)<br/>❌ NO MUTEX!
        
        and
        
        SvelteKit Task->>SvelteKit Task: loop20ms() -> receiveUDP()
        SvelteKit Task->>SvelteKit Task: Processes UDP packet
        SvelteKit Task->>ModuleDevices State: updateDevices() MODIFIES state<br/>(lines 132-165)<br/>❌ NO MUTEX!
    end
    
    Note over ModuleDevices State: JSON CORRUPTION!<br/>HTTP task reads during SvelteKit modification
    
    HTTP Task->>ModuleDevices State: Gets corrupted device data<br/>(wrong index or default values)
    HTTP Task->>HTTP Task: Sends UDP with bri=0 pal=0 preset=0<br/>to wrong IP (192.168.1.189)
    
    SvelteKit Task->>ModuleDevices State: Now calls update() (line 177)<br/>Tries to acquire mutex
    
    Note over SvelteKit Task: Holds mutex for JSON operations
    
    HTTP Task->>ModuleDevices State: Another update arrives<br/>Tries to acquire mutex
    Note over HTTP Task: WAITS 500ms+<br/>Mutex held by SvelteKit task
Loading

🎯 Why This Explains Everything

1. The Spurious 192.168.1.189 Update

// Line 80 in onUpdate (HTTP task, NO MUTEX):
JsonObject device = _state.data["devices"][updatedItem.index[0]];

// If SvelteKit task is simultaneously modifying the array in updateDevices:
// - Array is being resorted (line 169-177)
// - Indices change mid-read
// - HTTP task reads wrong device or gets empty JsonObject
// - Empty JsonObject returns 0 for integers, false for bools
// - Result: bri=0 pal=0 preset=0 sent to wrong IP!

2. The 500ms+ Mutex Hold

When updateDevices (SvelteKit task) finally acquires the mutex at line 177:

  • It's already modified the JSON (lines 132-165)
  • Now it calls update() which triggers compareRecursive
  • With 15 devices and nested JSON comparisons, this can take hundreds of milliseconds
  • Meanwhile, HTTP task's next update attempt waits for the mutex → 500ms timeout!

3. Why It Happens With Rapid Updates

  • Rapid brightness changes → multiple HTTP task update() calls
  • Each triggers onUpdate() reading from _state.data (no mutex)
  • SvelteKit task simultaneously processing UDP packets → modifying _state.data (no mutex)
  • More updates = more chances for collision

✅ Solution: Protect updateDevices with Mutex

Wrap the state modifications in a read-modify-update pattern:

void updateDevices(const UDPMessage& message, IPAddress ip) {
    // Build the update OUTSIDE any mutex
    JsonDocument doc;
    JsonObject updates = doc.to<JsonObject>();
    
    // ✅ Use update() to safely modify state
    update(
        [&](ModuleState& state) {
            if (state.data["devices"].isNull()) state.data["devices"].to<JsonArray>();
            JsonArray devices = state.data["devices"];
            
            JsonObject device = JsonObject();
            for (JsonObject dev : devices) {
                if (dev["ip"] == ip.toString()) {
                    device = dev;
                    break;
                }
            }
            if (device.isNull()) {
                device = devices.add<JsonObject>();
                EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], message.name);
                device["ip"] = ip.toString();
            }
            
            device["time"] = time(nullptr);
            device["name"] = message.name.c_str();
            device["version"] = message.version.c_str();
            device["uptime"] = message.uptime;
            device["packageSize"] = message.packageSize;
            device["lightsOn"] = message.lightsOn;
            device["brightness"] = message.brightness;
            device["palette"] = message.palette;
            device["preset"] = message.preset;
            
            return StateUpdateResult::CHANGED;
        },
        String(_moduleName) + "server"
    );
    
    // Note: sorting happens automatically via the update mechanism
    // or move sorting logic here if needed
}

This ensures all state access is mutex-protected, preventing the race condition.

📝 Learning Added


Would you like me to help implement the corrected updateDevices() function with proper mutex protection?

🐰
(_/)
(o.o) taking notes
(> <)


✏️ Learnings added
Learnt from: ewowi
URL: https://github.com/MoonModules/MoonLight/pull/114

Timestamp: 2026-01-26T15:27:55.361Z
Learning: In MoonModules/MoonLight: Never implement auto-restart (esp_restart()) on mutex timeout or errors. The 500ms mutex timeout warning in StatefulService is an intentional debugging aid that should continue execution to enable better debugging. Only users should trigger restarts, never automatic error handling. This prevents masking bugs and allows developers to diagnose issues.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-25T10:45:58.911Z
Learning: In MoonLight ModuleDevices: Variable Length Arrays (VLAs) like `char buffer[packetSize]` on ESP32 task stack are unsafe because tasks have limited stack space (~8KB). When packetSize exceeds buffer size or stack limit, causes stack overflow and crashes (including WebSocket failures). Solution: use `char buffer[sizeof(UDPMessage)]` or static/heap allocation, and always validate packetSize before reading: `if (packetSize > sizeof(UDPMessage)) { deviceUDP.flush(); continue; }`.
</learning]>

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-26T14:59:22.069Z
Learning: In MoonModules/MoonLight PR `#114`: ModuleDevices deadlock caused by: (1) StatefulService.h line 204-205 bug - after 500ms mutex timeout, continues WITHOUT acquiring mutex, causing race conditions and JSON corruption. (2) receiveUDP() while-loop processes own UDP broadcasts via loopback, creating circular call chain: sendUDP→broadcast→receiveUDP→updateDevices→update→handlers→sendUDP. (3) updateDevices() holds _accessMutex during expensive sort/JSON operations with 15 devices, can exceed 500ms with nested calls. Solution: Fix StatefulService to wait forever or reboot on timeout; skip own broadcasts by checking message.name == ownHostname; optimize updateDevices to sort outside mutex.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-08T11:39:52.878Z
Learning: In MoonLight WebSocket monitor: When browser tabs are hidden, browsers throttle JavaScript (including WebSocket onmessage handlers) but keep TCP connections open. Monitor data (48KB frames every 20-300ms) continues arriving at TCP level but isn't processed. This fills ESP32's httpd_ws_send_frame_async() buffers, causing ESP_FAIL. When tab becomes visible, hundreds of queued frames overwhelm the JavaScript main thread, freezing the UI. Solution: use Page Visibility API (client_visibility event) to stop emitting monitor data to hidden clients, preventing buffer overflow and backlog buildup.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-06T13:45:03.067Z
Learning: For MoonLight on ESP32 at extreme loads (e.g., 49K channels @ 100 fps): Task watchdog can fire when AppDrivers/AppEffects process many nodes without yielding. Even with vTaskDelay(1) at loop end, the node iteration for-loops (PhysicalLayer::loopDrivers line 108-111, PhysicalLayer::loop line 73-75) can take >5 seconds without yielding, starving IDLE task (priority 0). Solution: add taskYIELD() or vTaskDelay(1) inside the node/layer iteration loops to give IDLE task chances to run and reset watchdog. Lowering task priority won't help since IDLE is already at 0 (lowest).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-18T12:49:07.297Z
Learning: In MoonLight channel reordering architecture: Keep rgbwBufferMapping in driver task (not effect task) because: (1) Driver has spare CPU cycles waiting for DMA/I2S/Art-Net timing; effect task is CPU-bound and starvation-prone. (2) Contiguous RGB writes in effects are cache-friendly; scattered writes at final offsets are cache-hostile for non-RGB orders (GRB, BGR, etc.). (3) forEachLightIndex already handles sparse mapping iteration; moving reordering to effects adds complexity without benefit since drivers must send full frames anyway (can't skip iteration). (4) rgbwBufferMapping does brightness/gamma LUT + white extraction + reordering—moving just reordering creates code duplication. (5) Effects should express color intent (CRGB), not hardware details (offsetRed/Green/Blue). Driver-side transformation maintains separation of concerns and task load balance for scaling to 98K+ pixels.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-15T12:38:33.598Z
Learning: In MoonLight GameOfLifeEffect (src/MoonLight/Nodes/Effects/E_MoonModules.h): The effect's nested loops (especially lines 257-323) iterate through all cells (e.g., 128×128 = 16,384) with nested neighbor checks (18 per cell = 294,912 operations) without yielding. This starves effectTask (priority 3) when higher-priority tasks like HTTP server (priority 5) run, causing the effect to disappear. UI actions (save, refresh) temporarily show one frame because HTTP completes and briefly releases CPU back to effectTask. Solution: Add taskYIELD() inside outer X loop to prevent task starvation. This is the same root cause as issue `#102` (Starsky glitches with many driver nodes).

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 84
File: src/MoonLight/Modules/ModuleLightsControl.h:367-379
Timestamp: 2025-12-28T10:28:24.157Z
Learning: In MoonLight PR `#84`: wrapping effectTask's layerP.loop() with monitorMutex while reading channelsE for monitor emission is thread-safe and prevents partial frames, but introduces serialization between effects rendering and monitor emission. Reading from channelsD (driver/front buffer) instead—with monitorMutex held only during the brief pointer swap—is superior because it maintains full parallelism (effects ║ monitor ║ driver), shows actual LED output, and minimizes mutex hold time.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 0
File: :0-0
Timestamp: 2026-01-15T15:13:55.631Z
Learning: In MoonLight effectTask (src/main.cpp): The memcpy(channelsE, channelsD) at line 137 happens AFTER releasing swapMutex, allowing driverTask to swap the channelsD/channelsE pointers mid-copy, causing buffer corruption. This specifically breaks effects like GameOfLife that use getRGB() to read previous-frame data for blending decisions. Other effects work because they only read current-frame data they just wrote. Fix: hold swapMutex during the entire memcpy operation to prevent pointer swap during copy.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 114
File: src/MoonBase/Modules/ModuleDevices.h:39-49
Timestamp: 2026-01-26T12:58:42.286Z
Learning: In MoonModules/MoonLight: Do not add null checks for essential components like `moduleControl` in ModuleDevices or similar core dependencies. A null value for these essential components indicates a programming error (bug in initialization/calling code) that should fail-fast rather than be masked with defensive checks. This aligns with the "don't mask programming errors" philosophy used throughout the codebase.

Learnt from: ewowi
Repo: MoonModules/MoonLight PR: 104
File: src/MoonLight/Layers/PhysicalLayer.cpp:145-147
Timestamp: 2026-01-13T15:18:48.740Z
Learning: In MoonLight's PhysicalLayer and Node architecture (src/MoonLight/Layers/PhysicalLayer.cpp, src/MoonBase/Nodes.h): effectsMutex, driversMutex, and layerMutex are mandatory for system operation. These mutexes are validated at creation time (constructor checks log errors if xSemaphoreCreateMutex returns nullptr). Runtime null checks on these mutex pointers are intentionally omitted—the design treats successful mutex allocation as an invariant after initialization, placing responsibility on the programmer to ensure proper allocation. This is a performance-oriented "fail-fast at boot" approach rather than defensive runtime checking.
<!-- </add_learning>

backend
=======
- module: mutexedOriginId not pointer but value
- devices: refactor updateDevices: only sort if new device
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@lib/framework/StatefulService.h`:
- Around line 201-205: The current code warns when
xSemaphoreTakeRecursive(_accessMutex,...) times out twice but then continues
without owning _accessMutex, risking concurrent mutation of _state and an
unowned mutex in endTransaction(); update the failure path so after logging the
ESP_LOGW (keeping the warning) you either block until the mutex is acquired by
calling xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY) or explicitly abort
the transaction by returning an error/setting a failure state and ensuring
callers of the surrounding function (and endTransaction()) handle this abort;
refer to _accessMutex, xSemaphoreTakeRecursive, endTransaction(), _state,
originId, pcTaskGetName and xTaskGetCurrentTaskHandle() to locate and modify the
code.

In `@src/MoonBase/Module.h`:
- Around line 47-51: The static vs non-static mix causes cross-instance races;
make the queue per-module by converting mutexedUpdatedItem, mutexedOriginId, and
updateMutex to non-static instance members (so they match the existing
non-static mutexedUpdatePending), remove their static definitions from
Module.cpp, and initialize/create the updateMutex in the ModuleState constructor
and free it in the destructor; also update postUpdate() and getUpdate() usages
to refer to the instance members (mutexedUpdatedItem, mutexedOriginId,
updateMutex, mutexedUpdatePending) so each Module instance has its own queue and
semaphore.

In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 137-176: The code mutates _state.data via the doc reference when
!_socket->getActiveClients(), causing a race; protect these writes by using the
provided update() API or acquiring _accessMutex for the entire block that
reads/creates devices, finds or adds device (the loop over devices), and then
sets device["time"], device["name"], device["version"], device["uptime"],
device["packageSize"], device["lightsOn"], device["brightness"],
device["palette"], and device["preset"]; specifically, replace the doc =
_state.data path with a copy via doc.set(_state.data) or wrap from the doc
creation through the final device[...] assignments in a scoped lock of
_accessMutex (or call update(...) so the update is mutex-protected) and leave
the early return if (!_socket->getActiveClients()) only after the protected
section.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)

100-100: Uptime calculation still returns boot timestamp, not uptime.

time(nullptr) - pal::millis() / 1000 computes the boot epoch, not the uptime. If the field is meant to hold uptime seconds, use pal::millis() / 1000 unconditionally.

🔧 Suggested fix
-        message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
+        message.uptime = pal::millis() / 1000;

Also applies to line 258.

Comment on lines 201 to 205
if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(100))==pdFALSE) {
ESP_LOGI("🐼", "_accessMutex %s wait 100ms", originId.c_str());
// ESP_LOGI("🐼", "_accessMutex %s wait 100ms", originId.c_str());
if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
ESP_LOGW("🐼", "_accessMutex %s waited 500ms and continues", originId.c_str());
ESP_LOGW("🐼", "%s _accessMutex %s waited 500ms and continues", pcTaskGetName(xTaskGetCurrentTaskHandle()), originId.c_str());
// xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Block or abort on mutex timeout — continuing without the lock is unsafe.

If both waits fail, the function proceeds without owning _accessMutex, so callers mutate _state concurrently and endTransaction() gives an unowned mutex. This matches the corruption issues observed in this repo. Please keep the warning (no auto‑restart), but do ensure the lock is acquired or fail the transaction.

✅ Safer fix (block after warning)
-            if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
-                ESP_LOGW("🐼", "%s _accessMutex %s waited 500ms and continues", pcTaskGetName(xTaskGetCurrentTaskHandle()), originId.c_str());
-                // xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);
-            }
+            if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
+                ESP_LOGW("🐼", "%s _accessMutex %s waited 500ms; blocking until acquired", pcTaskGetName(xTaskGetCurrentTaskHandle()), originId.c_str());
+                xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);
+            }

Based on learnings, the warning should remain but execution must not proceed without the mutex.

🤖 Prompt for AI Agents
In `@lib/framework/StatefulService.h` around lines 201 - 205, The current code
warns when xSemaphoreTakeRecursive(_accessMutex,...) times out twice but then
continues without owning _accessMutex, risking concurrent mutation of _state and
an unowned mutex in endTransaction(); update the failure path so after logging
the ESP_LOGW (keeping the warning) you either block until the mutex is acquired
by calling xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY) or explicitly
abort the transaction by returning an error/setting a failure state and ensuring
callers of the surrounding function (and endTransaction()) handle this abort;
refer to _accessMutex, xSemaphoreTakeRecursive, endTransaction(), _state,
originId, pcTaskGetName and xTaskGetCurrentTaskHandle() to locate and modify the
code.

Comment on lines +47 to +51
// for post/getUpdate
static UpdatedItem mutexedUpdatedItem;
static String mutexedOriginId;
static SemaphoreHandle_t updateMutex;
bool updatePending = false; // should not be static as each module needs to keep track of it's own status

static Char<20> updateOriginId; // static, written by ModuleState::update, no mutex needed as written by one process at a time (http mostly, sveltekit sometimes recursively)
bool mutexedUpdatePending = false; // should not be static as each module needs to keep track of it's own status
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the Module.h file around lines 47-51 to see the actual code
head -60 src/MoonBase/Module.h | tail -20

Repository: MoonModules/MoonLight

Length of output: 745


🏁 Script executed:

# Get more context from Module.h - check the class definition and static members
sed -n '40,160p' src/MoonBase/Module.h

Repository: MoonModules/MoonLight

Length of output: 4742


🏁 Script executed:

# Check Module.cpp for static member definitions
fd -e cpp -x grep -l "ModuleState::" {} \; | head -5

Repository: MoonModules/MoonLight

Length of output: 90


🏁 Script executed:

# Search for ModuleState instantiations to understand usage patterns
rg "ModuleState" --type cpp --type h -C3 | head -100

Repository: MoonModules/MoonLight

Length of output: 5174


🏁 Script executed:

# Search for Module/ModuleState instantiation to understand usage patterns
rg "new Module|Module(" --type cpp --type h -B2 -A2 | head -60

Repository: MoonModules/MoonLight

Length of output: 151


🏁 Script executed:

# Check for any guard or synchronization that might prevent multiple concurrent instances
rg "ModuleState" --type cpp -A5 -B5 | grep -E "(static|singleton|instance)" | head -20

Repository: MoonModules/MoonLight

Length of output: 668


🏁 Script executed:

# Look at the template usage - how many StatefulService<ModuleState> instances are created
rg "StatefulService<ModuleState>" --type cpp --type h -C3

Repository: MoonModules/MoonLight

Length of output: 352


🏁 Script executed:

# Search for Module instantiation in main and other files
rg "Module\(" --type cpp -B2 -A2 | head -80

Repository: MoonModules/MoonLight

Length of output: 5646


🏁 Script executed:

# Count the number of module types/instances being created
fd -e cpp | xargs grep -l "Module(" | head -10

Repository: MoonModules/MoonLight

Length of output: 101


🏁 Script executed:

# Look at main.cpp to see how many modules are instantiated
grep -n "Module" src/main.cpp | head -30

Repository: MoonModules/MoonLight

Length of output: 2360


🏁 Script executed:

# Check the Module constructor to understand instantiation
sed -n '1,50p' src/MoonBase/Module.cpp

Repository: MoonModules/MoonLight

Length of output: 2036


Race condition confirmed: mixed static/non-static queue variables with multiple Module instances.

With 9+ different Module instances in the system (ModuleLightsControl, ModuleDevices, ModuleEffects, ModuleDrivers, ModuleLiveScripts, ModuleChannels, etc.), the current design creates a race condition:

  • mutexedUpdatedItem and mutexedOriginId are static (shared across all modules)
  • mutexedUpdatePending is non-static (per-instance)

When Module A calls postUpdate(), it sets the shared static queue and its own instance flag. When Module B calls postUpdate() before Module A's getUpdate() runs, it overwrites the static queue data. Module A's getUpdate() then reads Module B's data from the corrupted queue.

Either make all three static (single shared queue) or all three non-static (per-module queues).

🔧 Option 1: Make mutexedUpdatePending static for a single shared queue
-  bool mutexedUpdatePending = false;  // should not be static as each module needs to keep track of it's own status
+  static bool mutexedUpdatePending;

And in Module.cpp:

bool ModuleState::mutexedUpdatePending = false;
🔧 Option 2: Make all three non-static for per-module queues
-  static UpdatedItem mutexedUpdatedItem;
-  static String mutexedOriginId;
+  UpdatedItem mutexedUpdatedItem;
+  String mutexedOriginId;

Remove the corresponding static definitions from Module.cpp.

🤖 Prompt for AI Agents
In `@src/MoonBase/Module.h` around lines 47 - 51, The static vs non-static mix
causes cross-instance races; make the queue per-module by converting
mutexedUpdatedItem, mutexedOriginId, and updateMutex to non-static instance
members (so they match the existing non-static mutexedUpdatePending), remove
their static definitions from Module.cpp, and initialize/create the updateMutex
in the ModuleState constructor and free it in the destructor; also update
postUpdate() and getUpdate() usages to refer to the instance members
(mutexedUpdatedItem, mutexedOriginId, updateMutex, mutexedUpdatePending) so each
Module instance has its own queue and semaphore.

Comment on lines +137 to +176
if (_socket->getActiveClients()) { // rebuild the devices array
doc.set(_state.data); // copy
} else {
doc = _state.data; // reference
}

JsonObject device = JsonObject();
// set the devices array
JsonArray devices;
devices = doc["devices"];

// find out if we have a new device
JsonObject device = JsonObject();
bool newDevice = true;
for (JsonObject dev : devices) {
if (dev["ip"] == ip.toString()) {
device = dev;
newDevice = false;
break; // found so leave for loop
// EXT_LOGD(ML_TAG, "updated ...%d %s", ip[3], name);
}
}
if (device.isNull()) {

// set the device object
if (newDevice) {
device = devices.add<JsonObject>();
EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], name);
EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], message.name);
device["ip"] = ip.toString();
}

device["name"] = name; // name can change
device["time"] = time(nullptr); // time will change, triggering update
device["name"] = message.name.c_str();
device["version"] = message.version.c_str();
device["uptime"] = message.uptime;
device["packageSize"] = message.packageSize;
device["lightsOn"] = message.lightsOn;
device["brightness"] = message.brightness;
device["palette"] = message.palette;
device["preset"] = message.preset;

if (!_socket->getConnectedClients()) return; // no need to update if no clients
if (!_socket->getActiveClients()) return; // no need to update if no clients
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Race condition: _state.data modified without mutex protection.

When !_socket->getActiveClients(), the assignment doc = _state.data creates a reference (per comment). Subsequent writes to device["time"], device["name"], etc. (lines 166–174) then mutate _state.data directly—without holding _accessMutex. Concurrent HTTP/WebSocket tasks can read corrupted JSON.

Use the update() API to ensure mutex-protected writes, or protect this entire section with the mutex. Based on learnings, this is a known root cause of JSON corruption and spurious updates.

🔧 Suggested refactor
  void updateDevices(const UDPMessage& message, IPAddress ip) {
-    // EXT_LOGD(ML_TAG, "updateDevices ...%d %s", ip[3], name);
-    if (_state.data["devices"].isNull()) _state.data["devices"].to<JsonArray>();
-
-    // set the doc
-    JsonDocument doc;
-    if (_socket->getActiveClients()) {  // rebuild the devices array
-      doc.set(_state.data);             // copy
-    } else {
-      doc = _state.data;  // reference
-    }
+    // Always work on a copy, then push via update() for thread-safe mutation
+    JsonDocument doc;
+    doc.set(_state.data);
🤖 Prompt for AI Agents
In `@src/MoonBase/Modules/ModuleDevices.h` around lines 137 - 176, The code
mutates _state.data via the doc reference when !_socket->getActiveClients(),
causing a race; protect these writes by using the provided update() API or
acquiring _accessMutex for the entire block that reads/creates devices, finds or
adds device (the loop over devices), and then sets device["time"],
device["name"], device["version"], device["uptime"], device["packageSize"],
device["lightsOn"], device["brightness"], device["palette"], and
device["preset"]; specifically, replace the doc = _state.data path with a copy
via doc.set(_state.data) or wrap from the doc creation through the final
device[...] assignments in a scoped lock of _accessMutex (or call update(...) so
the update is mutex-protected) and leave the early return if
(!_socket->getActiveClients()) only after the protected section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants