-
-
Notifications
You must be signed in to change notification settings - Fork 11
Super sync part 1 #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Super sync part 1 #114
Conversation
backend ======= - main: moduleLightsControl added as argument to ModuleDevices - Devices: extend UDPMessage with version, uptime, brightness, palette, preset, add LightsControl->addUpdateHandler sending UDP message when something changes, updateDevices: send new vars
WalkthroughModuleDevices gains richer UDPMessage fields and a Module* control dependency; UDP receive/send now route control vs status and update devices/UI. ModuleState/Module update APIs were made origin-aware (originId) and multiple modules’ onUpdate signatures updated; ModuleDevices integrated into main with 20ms/10s loops. Changes
Sequence DiagramsequenceDiagram
participant UI as UI/REST
participant MC as ModuleLightsControl
participant MD as ModuleDevices
participant NET as UDP Network
MC->>MD: local state update (brightness,palette,preset,lightsOn) with originId
activate MD
MD->>MD: build UDPMessage (hostname,version,uptime,packageSize,lightsOn,brightness,palette,preset,isControlCommand=false)
MD->>NET: sendUDP(broadcast/target)
MD->>UI: optionally emit device state update
deactivate MD
NET->>MD: incoming UDPMessage (remote IP)
activate MD
MD->>MD: receiveUDP -> parse & validate UDPMessage (skip own)
alt message.isControlCommand == true
MD->>MC: apply control update (use originId "udp")
else
MD->>MD: updateDevices(message, remoteIP)
MD->>UI: update device entry (name,version,uptime,packageSize,lightsOn,brightness,palette,preset)
end
deactivate MD
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 47-54: The code is assigning the whole JSON object
state.data["preset"] to message.preset (and logging it), which when cast to
uint8_t becomes 0; instead extract the selected preset field (e.g., the numeric
property like "selected" or "value") from the JSON object and assign that to
message.preset, and log that numeric value rather than serializing the whole
object; update the lambda that handles ModuleState (references: ModuleState,
state.data["preset"], message.preset, serializeJson, EXT_LOGD, MB_TAG) to read
the correct subfield and convert it to uint8_t before assignment and logging
(apply the same change to the other occurrences around lines 188-191).
- Line 58: The assignment to message.uptime currently computes time(nullptr) -
millis()/1000 which yields the boot timestamp and can underflow; update the two
locations where message.uptime is set (the message.uptime assignments in
ModuleDevices.h) to assign uptime seconds directly as millis()/1000 if you
intend to store uptime, or if you intended to store the boot epoch rename the
field (e.g., bootTime) and then compute boot time as time(nullptr) -
(millis()/1000); also ensure any consumer code uses the new semantics/field
name.
- Around line 169-175: The UDPMessage struct is partially uninitialized when
packetSize < sizeof(UDPMessage); before memcpy into UDPMessage (symbol: message)
zero-initialize it (e.g., memset(&message, 0, sizeof(message)) or
value-initialize UDPMessage message) so any trailing fields like
version/brightness/palette/preset are deterministic, then perform deviceUDP.read
into buffer and memcpy(&message, buffer, min(packetSize, sizeof(message))).
Ensure you still use ::min(packetSize, sizeof(message)) for the copy length.
src/MoonBase/Modules/ModuleDevices.h
Outdated
| _moduleName); | ||
|
|
||
| EXT_LOGD(MB_TAG, "b: %d pa %d pr %d", message.brightness, message.palette, message.preset); | ||
| message.uptime = time(nullptr) - millis() / 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uptime calculation returns boot time (and can underflow).
time(nullptr) - millis()/1000 yields the boot timestamp, not uptime, and can underflow before time is set. If you want uptime seconds, use millis()/1000; if you want boot time, rename the field.
🔧 Suggested fix
- message.uptime = time(nullptr) - millis() / 1000;
+ message.uptime = millis() / 1000;- message.uptime = time(nullptr) - millis() / 1000;
+ message.uptime = millis() / 1000;Also applies to: 194-194
🤖 Prompt for AI Agents
In `@src/MoonBase/Modules/ModuleDevices.h` at line 58, The assignment to
message.uptime currently computes time(nullptr) - millis()/1000 which yields the
boot timestamp and can underflow; update the two locations where message.uptime
is set (the message.uptime assignments in ModuleDevices.h) to assign uptime
seconds directly as millis()/1000 if you intend to store uptime, or if you
intended to store the boot epoch rename the field (e.g., bootTime) and then
compute boot time as time(nullptr) - (millis()/1000); also ensure any consumer
code uses the new semantics/field name.
|
@coderabbitai, for some reason the package send in _moduleLightsControl->addUpdateHandler arrives very late (in the next 10s update?) in another device, which calls readUDP(), calling updateDevices. The 10s message (mostly... packages missed?) |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 24698 🏁 Script executed: Length of output: 4693 🏁 Script executed: Length of output: 23214 🏁 Script executed: Length of output: 766 🏁 Script executed: Length of output: 4475 🏁 Script executed: Length of output: 2683
Root CauseWhen However, void readUDP() {
size_t packetSize = deviceUDP.parsePacket(); // ← Only called ONCE
if (packetSize >= 38) {
// ... process one packet
updateDevices(message, deviceUDP.remoteIP());
}
}If 10 packets are queued, it takes 10 seconds to drain them all at 1 packet/second, which explains the "next 10s update" delay you're seeing. SolutionDrain all available packets in void readUDP() {
// Process all available UDP packets in the buffer
while (size_t packetSize = deviceUDP.parsePacket()) {
if (packetSize >= 38) { // WLED has 44, MM had ! 38
char buffer[packetSize];
UDPMessage message;
deviceUDP.read(buffer, packetSize);
memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
// EXT_LOGD(ML_TAG, "UDP packet read from %d: %s (%d)", deviceUDP.remoteIP()[3], buffer + 6, packetSize);
updateDevices(message, deviceUDP.remoteIP());
}
}
}This will process all queued packets immediately instead of one per second, eliminating the delay. |
backend ======= - main: add last20ms and use for devices - devices: loop20 for direct response, readudp loops over received packets, use preset.selected
|
@coderabbitai, can you review latest commit: ced9a76 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 7649
✅ Core Fix: readUDP() While LoopPerfect implementation of the suggested fix: while (size_t packetSize = deviceUDP.parsePacket()) {
if (packetSize >= 38) {
// ... process packet
}
}This drains all queued packets immediately instead of one per call, eliminating the backlog delay. 🚀 Major Improvement: 20ms Response TimeRenaming
📝 Other Improvements
|
|
@coderabbitai, in ModuleDevices, I extended the struct UDPMessage as follows: struct UDPMessage { (packageSize and lightsOn added). If I flash this to 2 devices, an esp32-p4 and an esp32-s3, the p4 keeps running fine, but the s3 is not updating the UI showing the devices anymore (but it receives UDP packages). There are about 15 devices running in the network, which are updated on refreshing the screen, but not after that via its web socket connection. It works fine without packageSize and lightsOn added. On the browser I get the following error repeatedly - which is only 'occassionally' on the p4. I am not sure if it is because the p4 is faster and bigger ... so do not focus only on that - it might just be a coincidence it is a P4. In general, I see this BUFFER_SHORTAGE message in the browser log for a long time, so if you can get to the bottom why this is and maybe we can avoid it knowing what causes this. [Error] Error: BUFFER_SHORTAGE — bundle.js:3:9744 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 3063 🏁 Script executed: Length of output: 3727 🏁 Script executed: Length of output: 2156 🏁 Script executed: Length of output: 10886
🔴 Root Cause: WebSocket FloodingEvery UDP packet triggers a full WebSocket broadcast of all 15 devices (line 166 in void updateDevices(const UDPMessage& message, IPAddress ip) {
// ... modify device in _state.data ...
update(controls, ModuleState::update, String(_moduleName) + "server"); // ← Triggers WS broadcast
}With 15 devices in your network and
🟡 Secondary Issue: Struct Size MismatchYou have 13 old devices (sending 77-80 byte packets) and 2 new devices (expecting 80-84 bytes): # Old struct: 6+32+32+4+1+1+1 = 77 bytes → padded to ~80
# New struct: 6+32+32+4+2+1+1+1+1 = 80 bytes → padded to ~84When a new device reads an old packet: memcpy(&message, buffer, ::min(packetSize, sizeof(message))); // ← 77 vs 84 bytesThe
✅ Solutions1️⃣ Add Rate Limiting to updateDevices (Essential)Only send WebSocket updates at most once per second per device: void updateDevices(const UDPMessage& message, IPAddress ip) {
// ... existing code to update device ...
device["time"] = time(nullptr);
device["name"] = message.name;
device["version"] = message.version;
device["uptime"] = message.uptime;
device["brightness"] = message.brightness;
device["palette"] = message.palette;
device["preset"] = message.preset;
if (!_socket->getConnectedClients()) return;
// ✅ NEW: Rate limit WebSocket updates
static time_t lastWsUpdate = 0;
if (time(nullptr) - lastWsUpdate < 1) { // Max 1 update/second
return; // Skip WebSocket emission
}
lastWsUpdate = time(nullptr);
// ... existing sort and WebSocket emission code ...
update(controls, ModuleState::update, String(_moduleName) + "server");
}This reduces WebSocket updates from 750/second → 1/second, eliminating buffer overflow. 2️⃣ Fix Struct Alignment (Critical for Mixed Firmware)Add packing directive to ensure consistent size across all devices: struct UDPMessage {
uint8_t rommel[6];
Char<32> name;
Char<32> version;
uint32_t uptime;
uint16_t packageSize;
bool lightsOn;
uint8_t brightness;
uint8_t palette;
uint8_t preset;
} __attribute__((packed)); // ✅ Force no paddingAnd zero-initialize to prevent garbage: void readUDP() {
while (size_t packetSize = deviceUDP.parsePacket()) {
if (packetSize >= 38) {
char buffer[packetSize];
UDPMessage message{}; // ✅ Zero-initialize
deviceUDP.read(buffer, packetSize);
memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
updateDevices(message, deviceUDP.remoteIP());
}
}
}3️⃣ Validate Packet Size (Backward Compatibility)Check for old vs new firmware: if (packetSize >= 77 && packetSize <= sizeof(UDPMessage)) { // Accept old and new
// ... process packet ...
}🔍 Why S3 Fails but P4 Works
The P4's extra RAM and speed mask the flooding issue, but both devices are being overwhelmed by 750 WebSocket messages/second. 📊 BUFFER_SHORTAGE ExplanationThis error is long-standing because even without your struct changes, the system was close to the threshold. The 3 extra bytes per device pushed it over the edge:
The solution is rate limiting, not increasing buffer sizes. TL;DR: Add the rate limiting code (solution 1) and struct packing (solution 2), then reflash all devices. The UI freeze will disappear. 🎯 🧠 Learnings used |
|
@coderabbitai, I increased the buffer length to 200 and now also the P4 protests, I use chrome to monitor the P4 (S3 on safari). Chrome shows now: Regarding:
it's not 50 updates per second the updateDevices is called, 50 times per second parsepacket is checked but only when new packages arrive they will be processed, that's every 10s per device, unless a device changes something in lights control (brightness, presets etc). Can you verify this statement and redo your math and conclusions |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 17993
✅ Corrected Update FrequencyYou're correct that
So WebSocket flooding is not the primary issue here. The actual problem is different. 🔴 Real Root Cause: Buffer OverflowWhen you changed line 172 from: char buffer[packetSize]; // VLA (Variable Length Array)to: char buffer[200]; // Fixed sizeIf deviceUDP.read(buffer, packetSize); // ← Reads packetSize bytes into 200-byte buffer!This can allow "arbitrary code execution or result in stack exhaustion". Stack-allocated buffers on ESP32 can cause "nasty crashes" when too large, and tasks "only have a limited amount of stack available". The stack corruption from buffer overflow crashes the WebSocket handler, causing: 🟡 Why BUFFER_SHORTAGE Still OccursThe
✅ Solutions1️⃣ Fix Buffer Overflow (Critical)Validate packet size before reading: void readUDP() {
while (size_t packetSize = deviceUDP.parsePacket()) {
if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
EXT_LOGW(MB_TAG, "Invalid UDP packet size: %d (expected %d-%d)",
packetSize, 38, sizeof(UDPMessage));
deviceUDP.flush(); // Discard invalid packet
continue;
}
char buffer[sizeof(UDPMessage)]; // Use struct size, not arbitrary 200
UDPMessage message{}; // Zero-initialize
deviceUDP.read(buffer, packetSize);
memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
updateDevices(message, deviceUDP.remoteIP());
}
}2️⃣ Add Struct Packing (For Mixed Firmware)Ensure consistent size across all devices: struct UDPMessage {
uint8_t rommel[6];
Char<32> name;
Char<32> version;
uint32_t uptime;
uint16_t packageSize;
bool lightsOn;
uint8_t brightness;
uint8_t palette;
uint8_t preset;
} __attribute__((packed)); // ✅ Force no padding3️⃣ Investigate BUFFER_SHORTAGE (Long-term)Since this is a pre-existing issue, investigate:
Would you like me to help investigate the BUFFER_SHORTAGE issue further once the buffer overflow is fixed? ✏️ Learnings added
🧠 Learnings used |
backend ======= - main: move devices from MoonBase to MoonLight (for the time being, because it sends lightscontrol) - devices: pack struct UDPMessage, write UDP in lights control updateHandler, add packageSize and lightsOn, add onUpdate sending updates to devices- WIP, writeUDP with including device update argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 188-197: Replace the VLA and unsafe reads by using a fixed-size
buffer constant (e.g., constexpr size_t MAX_UDP_PACKET = 512) and validate
packetSize against expected bounds before copying into a UDPMessage: if
packetSize < sizeof(UDPMessage) or packetSize > MAX_UDP_PACKET then read and
discard the packet into the fixed buffer (deviceUDP.read(discardBuf,
min(packetSize, MAX_UDP_PACKET))) and continue; otherwise read into the fixed
buffer (uint8_t buf[MAX_UDP_PACKET]) then memcpy(&message, buf,
sizeof(UDPMessage)) and call updateDevices(message, deviceUDP.remoteIP()).
Ensure all references to deviceUDP.parsePacket, deviceUDP.read, UDPMessage,
packetSize, buffer, and updateDevices are updated accordingly so no VLA is used
and out-of-range packets are drained.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)
201-226:uptimestill encodes boot epoch when NTP is set.
If this field is meant to be uptime seconds, usepal::millis()/1000consistently. If you want boot time, rename the field/label accordingly.🔧 Suggested fix (uptime semantics)
- message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000; + message.uptime = pal::millis() / 1000;
🧹 Nitpick comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)
75-107: Consider a short HTTP timeout to avoid blocking the UI loop if a device is offline.
A small timeout (or async) keeps the SvelteKit task responsive when POSTs can’t connect.💡 Suggested tweak (if supported by HTTPClient)
HTTPClient http; + http.setTimeout(1000); http.begin(url.c_str()); http.addHeader("Content-Type", "application/json");
|
@coderabbitai, I consider replacing the httpclient call by an udp message send to the receiver, I guess it is more leightweight and faster, is that correct ? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 6523
📊 Comparison: HTTP vs UDP
✅ Why UDP is Better Here
🔧 Implementation SuggestionInstead of the HTTP call in void onUpdate(const UpdatedItem& updatedItem) override {
if (_state.updateOriginId == (String(_moduleName) + "server").c_str()) return;
if (updatedItem.parent[0] == "devices") {
IPAddress targetIP;
if (!targetIP.fromString(_state.data["devices"][updatedItem.index[0]]["ip"].as<const char*>())) {
return; // Invalid IP
}
// Build UDP message with updated field
UDPMessage message{}; // Zero-initialize
message.name = esp32sveltekit.getWiFiSettingsService()->getHostname().c_str();
message.version = APP_VERSION;
message.packageSize = sizeof(message);
_moduleLightsControl->read([&](ModuleState& state) {
message.lightsOn = state.data["lightsOn"];
message.brightness = state.data["brightness"];
message.palette = state.data["palette"];
message.preset = state.data["preset"]["selected"];
}, _moduleName);
message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000;
// Send unicast UDP to specific device (not broadcast)
if (deviceUDP.beginPacket(targetIP, deviceUDPPort)) {
deviceUDP.write((uint8_t*)&message, sizeof(message));
deviceUDP.endPacket();
EXT_LOGD(ML_TAG, "UDP update sent to %s: %s=%d", targetIP.toString().c_str(),
updatedItem.name.c_str(), updatedItem.value.as<int>());
}
}
}
|
|
@coderabbitai, regarding
Sounds good, but the receiving device should update their lights when receiving this, in my rest api solution that is already implemented. How to do this with the UDP solution? And can web sockets also be a solution? Can web sockets also broadcast to multiple devices? And what about overhead and speed? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 13098
🔧 UDP Solution: Updating Lights on ReceiverCurrently, void updateDevices(const UDPMessage& message, IPAddress ip) {
// ... existing device list update code (lines 131-165) ...
device["time"] = time(nullptr);
device["name"] = message.name.c_str();
device["version"] = message.version.c_str();
device["uptime"] = message.uptime;
device["packageSize"] = message.packageSize;
device["lightsOn"] = message.lightsOn;
device["brightness"] = message.brightness;
device["palette"] = message.palette;
device["preset"] = message.preset;
// ✅ NEW: If this packet is intended for THIS device (not just broadcast status),
// update local lights control state to trigger hardware changes
IPAddress activeIP = WiFi.isConnected() ? WiFi.localIP() : ETH.localIP();
// Check if sender explicitly targeted this device (not broadcast)
// OR check a flag in UDPMessage to distinguish control commands from status broadcasts
if (/* add flag to UDPMessage: message.isControlCommand */) {
_moduleLightsControl->update(
[&](ModuleState& state) {
state.data["lightsOn"] = message.lightsOn;
state.data["brightness"] = message.brightness;
state.data["palette"] = message.palette;
state.data["preset"]["selected"] = message.preset;
return ModuleState::update;
},
String(_moduleName) + "udp" // originId to track update source
);
EXT_LOGD(ML_TAG, "Applied UDP control: bri=%d pal=%d preset=%d",
message.brightness, message.palette, message.preset);
}
// ... existing WebSocket update code (lines 166-185) ...
}Key insight: The Important: You'll need to distinguish between:
Add a 🌐 WebSockets: Not Suitable for Device-to-DeviceWebSockets are client-server architecture, not peer-to-peer:
Why WebSockets don't work:
📊 Protocol Comparison
Verdict: UDP is the correct choice for device-to-device light control synchronization. It's what you're already using for status broadcasts—just extend it to handle control commands too. 🎯 Recommended ArchitectureThis gives you the fastest possible sync (<20ms) with minimal overhead (80 bytes per command). Would you like me to help implement the 🧠 Learnings used |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 7285 🏁 Script executed: Length of output: 7532
🔧 UDP Solution: Making Receiver Update LightsThe key is that void readUDP() {
while (size_t packetSize = deviceUDP.parsePacket()) {
if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
deviceUDP.flush();
continue;
}
char buffer[sizeof(UDPMessage)];
UDPMessage message{};
deviceUDP.read(buffer, packetSize);
memcpy(&message, buffer, ::min(packetSize, sizeof(message)));
// ✅ Update device list (existing)
updateDevices(message, deviceUDP.remoteIP());
// ✅ NEW: Apply light changes to THIS device if fields are present
if (message.packageSize >= 80) { // New firmware with light control fields
_moduleLightsControl->update(
[&](ModuleState& state) {
bool changed = false;
if (state.data["lightsOn"].as<bool>() != message.lightsOn) {
state.data["lightsOn"] = message.lightsOn;
changed = true;
}
if (state.data["brightness"].as<uint8_t>() != message.brightness) {
state.data["brightness"] = message.brightness;
changed = true;
}
if (state.data["palette"].as<uint8_t>() != message.palette) {
state.data["palette"] = message.palette;
changed = true;
}
if (state.data["preset"]["selected"].as<uint8_t>() != message.preset) {
state.data["preset"]["selected"] = message.preset;
changed = true;
}
return changed ? StateUpdateResult::CHANGED : StateUpdateResult::UNCHANGED;
},
"udp-sync" // Origin ID to prevent loops
);
}
}
}Important: Add a check in _moduleLightsControl->addUpdateHandler(
[this](const String& originId) {
if (originId == "udp-sync") return; // ✅ Don't broadcast UDP changes back over UDP
writeUDP(false); // send update over the network
},
false);🌐 WebSocket Alternative AnalysisWebSockets cannot replace UDP for device-to-device sync because: Architecture Comparison
Performance Comparison
WebSocket Code Example (for reference only)If you wanted WebSockets (not recommended): // On sender device (would need to be connected as WebSocket client to other devices)
void sendWebSocketUpdate(IPAddress targetIP) {
// ❌ Problem: ESP32 would need WebSocket CLIENT library to connect to other ESP32s
// ❌ Problem: Each device needs to maintain persistent connections to all 15 devices
// ❌ Problem: 15 × 8KB = 120KB RAM just for connection buffers
// ❌ Problem: Who is server? Who is client? Need master election logic
}✅ RecommendationStick with UDP for device-to-device sync because:
The only advantage of WebSocket would be guaranteed delivery, but:
TL;DR: Use the UDP solution code above. It's faster, simpler, and fits your existing architecture perfectly. WebSockets would add complexity without meaningful benefits for this use case. 🎯 🧠 Learnings used |
backend ======= Devices: - add UDPMessage.isControlCommand - rename writeUDP to sendUDP and readUDP to receiveUDP - onUpdate: call sendUDP(targetIP) - receiveUDP: packageSize, device.clear, bufferSize, if isControlCommand then processCommand - sendUDP: targetIP argument, set isControl if targetIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 76-86: In onUpdate (method onUpdate / symbol updatedItem) add
guards to ensure updatedItem.parent and updatedItem.index are non-empty before
accessing [0]; specifically check updatedItem.parent.empty() and
updatedItem.index.empty() (or sizes) and return early if either is empty, then
proceed to read _state.data["devices"][updatedItem.index[0]]["ip"] and call
sendUDP(false, targetIP) as before; this prevents UB when indexing parent/index
and keeps the existing logic in onUpdate intact.
- Around line 39-49: The constructor currently accepts Module* moduleControl and
dereferences _moduleControl (e.g., in ModuleDevices constructor and when calling
_moduleControl->addUpdateHandler and later sendUDP callbacks) without null
checks; change the API to enforce non-null by taking Module& moduleControl
(update the ModuleDevices constructor signature and assignment to _moduleControl
to a reference) or, if pointer API must remain, validate moduleControl at start
of the ModuleDevices constructor and fail-fast (log error/throw/assert) when
null before assigning to _moduleControl and before calling addUpdateHandler;
ensure all uses reference the validated symbol (ModuleDevices constructor,
_moduleControl, addUpdateHandler, sendUDP) so no null dereference can occur.
- Around line 166-195: receiveUDP currently calls updateDevices() for every UDP
packet which can flood WebSocket sends; modify receiveUDP (and related
updateDevices usage) to coalesce or debounce broadcasts: inside receiveUDP
accumulate incoming UDPMessage changes into a single aggregated state (or track
a “dirty” flag and last-sent values) while draining the socket, then after the
receive loop call updateDevices(...) once (or call it at most once per X ms)
only if the aggregated state differs from the last-emitted state; also ensure
control-path using _moduleControl->update(...) similarly only triggers when
state actually changed to avoid redundant WebSocket pushes.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)
216-217:uptimecurrently stores boot timestamp, not uptime.With NTP set,
time(nullptr) - pal::millis()/1000yields boot time; the field name says uptime. Either rename the field or store uptime directly.🔧 Suggested fix (uptime seconds)
- message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000; + message.uptime = pal::millis() / 1000;
backend ======= Devices: - onUpdate: send UDP message using updatedItem - receiveUDP: set preset action and select - sendUDP: remove targetIP
|
@coderabbitai, can you review latest commit 2fdf976 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 5003
✅ What's Working Great1. Control Command Flow (Lines 79-106)Perfect implementation of unicast UDP control messages: message.isControlCommand = true;
deviceUDP.write((uint8_t*)&message, sizeof(message));
2. Loop Prevention (Lines 77 & 212)Smart originID-based loop prevention:
This ensures Device B doesn't re-broadcast when applying a control command from Device A ✓ 3. Broadcast Simplification (Lines 220-242)Clean separation of concerns:
|
backend ======= - Module: add mutexedOriginId, add originId to checkReOrderSwap, compareRecursive, postUpdate, processUpdatedItem, onUpdate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/MoonBase/Module.h`:
- Around line 47-52: mutexedOriginId currently stores a pointer to a String that
may be a temporary and thus can dangle; change the update queue to store a safe
copy of the origin id (e.g., replace static const String* mutexedOriginId with a
String mutexedOriginId or add a String member on UpdatedItem) and ensure
postUpdate() writes the copied String into mutexedUpdatedItem/UpdatedItem and
getUpdate() reads that copy; update any uses in postUpdate(), getUpdate(), and
the UpdatedItem structure so no call sites pass/persist raw pointers to
temporaries.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)
99-100: Uptime calculation still yields boot time, not uptime.If the field is meant to be uptime seconds,
time(nullptr) - pal::millis()/1000reports the boot epoch (and can underflow before time sync). Prefer monotonic uptime (or rename the field if boot time is intended).🔧 Suggested fix (if uptime is intended)
- message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000; + message.uptime = pal::millis() / 1000;- message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000; + message.uptime = pal::millis() / 1000;Also applies to: 237-238
🧹 Nitpick comments (1)
src/MoonBase/Modules/ModuleIO.h (1)
576-597: Factor the origin check into a single booleanRepeated
originId.endsWith("server")calls add noise and redo string scanning. Cache once for clarity and consistency.♻️ Suggested refactor
void onUpdate(const UpdatedItem& updatedItem, const String& originId) override { JsonDocument doc; JsonObject object = doc.to<JsonObject>(); - if (updatedItem.name == "boardPreset" && !originId.endsWith("server")) { // not done by this module: done by UI + const bool isServerOrigin = originId.endsWith("server"); + if (updatedItem.name == "boardPreset" && !isServerOrigin) { // not done by this module: done by UI // if booting and modded is false or ! booting if ((updatedItem.oldValue == "" && _state.data["modded"] == false) || updatedItem.oldValue != "") { // only update unmodded EXT_LOGD(MB_TAG, "%s %s[%d]%s[%d].%s = %s -> %s", originId.c_str(), updatedItem.parent[0].c_str(), updatedItem.index[0], updatedItem.parent[1].c_str(), updatedItem.index[1], updatedItem.name.c_str(), updatedItem.oldValue.c_str(), updatedItem.value.as<String>().c_str()); newBoardID = updatedItem.value; // run in sveltekit task } - } else if (updatedItem.name == "modded" && !originId.endsWith("server")) { // not done by this module: done by UI + } else if (updatedItem.name == "modded" && !isServerOrigin) { // not done by this module: done by UI // set pins to default if modded is turned off if (updatedItem.value == false) { EXT_LOGD(MB_TAG, "%s[%d]%s[%d].%s = %s -> %s", updatedItem.parent[0].c_str(), updatedItem.index[0], updatedItem.parent[1].c_str(), updatedItem.index[1], updatedItem.name.c_str(), updatedItem.oldValue.c_str(), updatedItem.value.as<String>().c_str()); newBoardID = _state.data["boardPreset"]; // run in sveltekit task } - } else if ((updatedItem.name == "switch1" || updatedItem.name == "switch2") && !originId.endsWith("server")) { + } else if ((updatedItem.name == "switch1" || updatedItem.name == "switch2") && !isServerOrigin) { // rebuild with new switch setting newBoardID = _state.data["boardPreset"]; // run in sveltekit task - } else if (updatedItem.name == "maxPower" && !originId.endsWith("server")) { + } else if (updatedItem.name == "maxPower" && !isServerOrigin) { object["modded"] = true; - } else if (updatedItem.name == "usage" && !originId.endsWith("server")) { // not done by this module: done by UI + } else if (updatedItem.name == "usage" && !isServerOrigin) { // not done by this module: done by UI object["modded"] = true; }
|
@coderabbitai, occasionally the system gots stuck on _accessMutex when I update brightness, preset, palette or lightsOn. _accessMutex is used in the ModuleState update() and read() function. Can you do a thorough review on the devices module how that can happen? Are there situations that a read() or update() is in progress and another one from another process is starting? onUpdate() and loop() are both done by the sveltekit task (please verify). or do other tasks kick in on the _accessMutex of devices? In general, please do a review of moduleDevices with all the commits added in this PR and provide a flow diagram showing the different scenarios and processes / tasks: receive a device message (every 10s + updates) per device, send the own device messages (every 10s + updates), also verify / proof no racing conditions can occur: eg. update a device -> send -> receive device update -> update UI (updateDevices) etc. Logging when _accessMutex is taken - this goes on repeatedly and the device becomes non responding requiring a restart: [ 24131][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=24 preset=15 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 27998 🏁 Script executed: Length of output: 5316 🏁 Script executed: Length of output: 248
🔴 Root Cause: Critical Bug in StatefulService.hLine 204-205 in StatefulService.h contains a DANGEROUS bug: ESP_LOGW("🐼", "_accessMutex %s waited 500ms and continues", originId.c_str());
// xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY); // ← This line is COMMENTED OUT!After the 500ms timeout, the code proceeds WITHOUT acquiring the mutex! This causes:
📊 Flow DiagramsNormal Flow (Working)sequenceDiagram
participant User
participant SvelteKit Task
participant ModuleDevices
participant ModuleLightsControl
participant UDP
User->>SvelteKit Task: Change brightness
SvelteKit Task->>ModuleLightsControl: update()
Note over ModuleLightsControl: Acquires mutex
ModuleLightsControl->>ModuleLightsControl: Update state
Note over ModuleLightsControl: Releases mutex
ModuleLightsControl->>ModuleDevices: Call update handler: sendUDP(false)
ModuleDevices->>ModuleLightsControl: read() to get state
Note over ModuleLightsControl: Acquires mutex (recursive OK)
ModuleLightsControl-->>ModuleDevices: Return state
Note over ModuleLightsControl: Releases mutex
ModuleDevices->>UDP: Broadcast packet
Deadlock Scenario (Bug Trigger)sequenceDiagram
participant loop20ms
participant receiveUDP
participant updateDevices
participant ModuleDevices State
participant UDP Buffer
Note over UDP Buffer: 5+ packets queued<br/>(own broadcasts + other devices)
loop20ms->>receiveUDP: Process packets
receiveUDP->>receiveUDP: while(parsePacket())
Note over receiveUDP: Packet 1: Control command
receiveUDP->>ModuleLightsControl: update() for control
ModuleLightsControl->>ModuleDevices: Trigger sendUDP(false)
ModuleDevices->>UDP Buffer: Broadcast (adds to queue)
Note over receiveUDP: Packet 2: Status from Device B
receiveUDP->>updateDevices: Process status
updateDevices->>ModuleDevices State: update() - Acquires mutex
Note over ModuleDevices State: Sort 15 devices<br/>Build JSON<br/>Update WebSocket
Note over receiveUDP: Packet 3: Own broadcast (loopback)
receiveUDP->>updateDevices: Process own packet AGAIN
updateDevices->>ModuleDevices State: update() - Try acquire mutex (RECURSIVE)
Note over ModuleDevices State: Nested update!
Note over receiveUDP: Packets 4, 5, 6... keep nesting
receiveUDP->>updateDevices: Keep calling updateDevices
updateDevices->>ModuleDevices State: Deeper recursion
Note over ModuleDevices State: After several nested calls:<br/>- JSON corruption from unprotected access<br/>- Stack near limit<br/>- Mutex held for >500ms
par Concurrent access attempts
loop20ms->>receiveUDP: Next 20ms tick
receiveUDP->>updateDevices: New packet
updateDevices->>ModuleDevices State: Try acquire mutex
Note over ModuleDevices State: TIMEOUT 500ms
Note over ModuleDevices State: BUG: Continues without mutex!
Note over ModuleDevices State: Corrupts JSON while other<br/>thread still processing
end
Note over ModuleDevices State: System deadlocked<br/>Repeated timeouts<br/>Data corruption
🔍 Specific Issues IdentifiedIssue 1: Self-Broadcast LoopLine 236 in sendUDP(true): deviceUDP.beginPacket(IPAddress(255, 255, 255, 255), deviceUDPPort); // BroadcastThis broadcast is received by the same device via loopback! The while loop in Issue 2: No Loop Prevention in receiveUDPLine 186-219: The while (size_t packetSize = deviceUDP.parsePacket()) {
// ... no check if message.name == own hostname!
if (message.isControlCommand) {
_moduleControl->update(newState, ModuleState::update, String(_moduleName) + "server");
} else
updateDevices(message, deviceUDP.remoteIP()); // ← Processes own broadcasts!
}Issue 3: updateDevices Holds Mutex Too LongLines 168-183: Sorting and JSON operations while holding mutex: void updateDevices(const UDPMessage& message, IPAddress ip) {
// ... build devices array ...
std::vector<JsonObject> devicesVector;
for (JsonObject dev : devices) { // ← Iteration
if (time(nullptr) - dev["time"].as<time_t>() < 86400) devicesVector.push_back(dev);
}
std::sort(devicesVector.begin(), devicesVector.end(), ...); // ← Sorting!
// ... build sorted JSON ...
update(controls, ModuleState::update, String(_moduleName) + "server"); // ← Acquires mutex
// All the above work is done INSIDE the update() call's mutex hold time!
}With 15 devices and nested calls, this can take >500ms! ✅ SolutionsSolution 1: FIX THE CRITICAL BUG (Mandatory!)In Option A: Wait forever (recommended for debugging): if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(100))==pdFALSE) {
if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
ESP_LOGW("🐼", "_accessMutex %s waited 500ms, waiting indefinitely...", originId.c_str());
xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY); // ✅ UNCOMMENT THIS!
}
}Option B: Fail-fast with error (better for production): if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(500))==pdFALSE) {
ESP_LOGE("🐼", "_accessMutex %s DEADLOCK DETECTED! Rebooting...", originId.c_str());
esp_restart(); // Better than continuing with corrupted state
}Solution 2: Prevent Self-Broadcast ProcessingIn ModuleDevices.h line 186, add origin check: void receiveUDP() {
String ownHostname = esp32sveltekit.getWiFiSettingsService()->getHostname();
while (size_t packetSize = deviceUDP.parsePacket()) {
if (packetSize < 38 || packetSize > sizeof(UDPMessage)) {
EXT_LOGW(MB_TAG, "Invalid UDP packet size: %d (expected %d-%d)", packetSize, 38, sizeof(UDPMessage));
deviceUDP.clear();
continue;
}
char buffer[sizeof(UDPMessage)];
UDPMessage message{};
deviceUDP.read(buffer, packetSize);
memcpy(&message, buffer, packetSize);
// ✅ Skip own broadcasts to prevent loops
if (message.name == ownHostname) {
EXT_LOGV(ML_TAG, "Skipping own broadcast");
continue;
}
if (message.isControlCommand) {
// ... existing code ...
} else {
updateDevices(message, deviceUDP.remoteIP());
}
}
}Solution 3: Optimize updateDevices to Reduce Mutex Hold TimeMove expensive operations outside the mutex: void updateDevices(const UDPMessage& message, IPAddress ip) {
// Build and sort BEFORE calling update()
std::vector<JsonObject> devicesVector;
JsonDocument doc2;
// Read current state
{
if (_state.data["devices"].isNull()) _state.data["devices"].to<JsonArray>();
JsonArray devices = _state.data["devices"];
// Find or create device entry
JsonObject device = JsonObject();
for (JsonObject dev : devices) {
if (dev["ip"] == ip.toString()) {
device = dev;
break;
}
}
if (device.isNull()) {
device = devices.add<JsonObject>();
EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], message.name);
device["ip"] = ip.toString();
}
// Update fields
device["time"] = time(nullptr);
device["name"] = message.name.c_str();
device["version"] = message.version.c_str();
device["uptime"] = message.uptime;
device["packageSize"] = message.packageSize;
device["lightsOn"] = message.lightsOn;
device["brightness"] = message.brightness;
device["palette"] = message.palette;
device["preset"] = message.preset;
if (!_socket->getConnectedClients()) return; // No WebSocket clients
// Copy for sorting (outside critical section)
for (JsonObject dev : devices) {
if (time(nullptr) - dev["time"].as<time_t>() < 86400) {
devicesVector.push_back(dev);
}
}
}
// ✅ Sort OUTSIDE the mutex-protected update() call
std::sort(devicesVector.begin(), devicesVector.end(),
[](JsonObject a, JsonObject b) { return a["name"] < b["name"]; });
// Build sorted JSON
doc2["devices"].to<JsonArray>();
for (JsonObject device : devicesVector) {
doc2["devices"].add(device);
}
// ✅ Now update with pre-built, sorted data (fast operation)
JsonObject controls = doc2.as<JsonObject>();
update(controls, ModuleState::update, String(_moduleName) + "server");
}Solution 4: Add originId to UDPMessagePrevent update loops by tracking who triggered the update: struct UDPMessage {
uint8_t rommel[6];
Char<32> name;
Char<32> originDevice; // ✅ Add this
Char<32> version;
uint32_t uptime;
uint16_t packageSize;
bool lightsOn;
uint8_t brightness;
uint8_t palette;
uint8_t preset;
bool isControlCommand;
} __attribute__((packed));Then use it to prevent circular updates in onUpdate (line 77). 📋 Task/Thread VerificationConfirmed: All ModuleDevices operations run in SvelteKit task (from main.cpp lines 371, 403):
No cross-task conflicts, but the while loop in receiveUDP() creates deep recursion when processing multiple queued packets. 🎯 Recommended Action Plan
After applying solutions 1 and 2, the deadlock should disappear completely. Would you like me to generate the complete corrected code for ModuleDevices.h? ✏️ Learnings added
🧠 Learnings used |
|
@coderabbitai, regarding
To be honest, I think none of this is true:
The problem occurs when updating a number of times, I suspect that it goes wrong if another task interferes, e.g. a 10s update comes in, so look again. Side issue: I only update to 192.168.1.132 but occasionally an update for 192.168.1.189 seems to be triggered, I don't know why. Note this seems to come from a device which does not have the code of this PR installed, so the message size is 38 or 44 bytes and the isControlMessage is not set in these devices, could they be set random (not 50% much lower as it happens occasionally). Don't know if this is related to above issue? [ 41196][D][ModuleDevices.h:105] onUpdate(): [💫] ESP32 SvelteKit (3668) UDP update sent to 192.168.1.132 bri=255 pal=49 preset=15 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 13554 🏁 Script executed: Length of output: 6229 🏁 Script executed: Length of output: 3855
After carefully tracing through the code and task execution, I found the actual root cause: a critical race condition between the HTTP server task and the SvelteKit task. 🔴 Real Root Cause: Unprotected State AccessThe bug is in void updateDevices(const UDPMessage& message, IPAddress ip) {
// ❌ Lines 132-165: Modifying _state.data WITHOUT mutex protection!
if (_state.data["devices"].isNull()) _state.data["devices"].to<JsonArray>();
JsonArray devices;
JsonDocument doc;
if (_socket->getConnectedClients()) {
doc.set(_state.data); // copy
devices = doc["devices"];
} else
devices = _state.data["devices"]; // ❌ Direct reference!
JsonObject device = JsonObject();
for (JsonObject dev : devices) { // ❌ Iteration without mutex
if (dev["ip"] == ip.toString()) {
device = dev;
}
}
if (device.isNull()) {
device = devices.add<JsonObject>(); // ❌ Modification without mutex
device["ip"] = ip.toString();
}
device["time"] = time(nullptr); // ❌ All these modifications unprotected
device["name"] = message.name.c_str();
device["version"] = message.version.c_str();
// ... etc
// ✅ Only here does it acquire mutex (line 177)
update(controls, ModuleState::update, String(_moduleName) + "server");
}📊 The Race ConditionsequenceDiagram
participant HTTP Task (Priority 5)
participant SvelteKit Task (Priority 3)
participant ModuleDevices State
participant ModuleLightsControl
Note over HTTP Task: User changes brightness in UI
HTTP Task->>ModuleLightsControl: update() via REST API
Note over ModuleLightsControl: Acquires ModuleLightsControl mutex
ModuleLightsControl->>ModuleLightsControl: Updates state
Note over ModuleLightsControl: Releases mutex
ModuleLightsControl->>HTTP Task: Calls updateHandler (line 45-49)
par Concurrent execution
HTTP Task->>HTTP Task: onUpdate() starts (line 76)
HTTP Task->>ModuleDevices State: Read devices[index] (line 80)<br/>❌ NO MUTEX!
and
SvelteKit Task->>SvelteKit Task: loop20ms() -> receiveUDP()
SvelteKit Task->>SvelteKit Task: Processes UDP packet
SvelteKit Task->>ModuleDevices State: updateDevices() MODIFIES state<br/>(lines 132-165)<br/>❌ NO MUTEX!
end
Note over ModuleDevices State: JSON CORRUPTION!<br/>HTTP task reads during SvelteKit modification
HTTP Task->>ModuleDevices State: Gets corrupted device data<br/>(wrong index or default values)
HTTP Task->>HTTP Task: Sends UDP with bri=0 pal=0 preset=0<br/>to wrong IP (192.168.1.189)
SvelteKit Task->>ModuleDevices State: Now calls update() (line 177)<br/>Tries to acquire mutex
Note over SvelteKit Task: Holds mutex for JSON operations
HTTP Task->>ModuleDevices State: Another update arrives<br/>Tries to acquire mutex
Note over HTTP Task: WAITS 500ms+<br/>Mutex held by SvelteKit task
🎯 Why This Explains Everything1. The Spurious 192.168.1.189 Update// Line 80 in onUpdate (HTTP task, NO MUTEX):
JsonObject device = _state.data["devices"][updatedItem.index[0]];
// If SvelteKit task is simultaneously modifying the array in updateDevices:
// - Array is being resorted (line 169-177)
// - Indices change mid-read
// - HTTP task reads wrong device or gets empty JsonObject
// - Empty JsonObject returns 0 for integers, false for bools
// - Result: bri=0 pal=0 preset=0 sent to wrong IP!2. The 500ms+ Mutex HoldWhen updateDevices (SvelteKit task) finally acquires the mutex at line 177:
3. Why It Happens With Rapid Updates
✅ Solution: Protect updateDevices with MutexWrap the state modifications in a read-modify-update pattern: void updateDevices(const UDPMessage& message, IPAddress ip) {
// Build the update OUTSIDE any mutex
JsonDocument doc;
JsonObject updates = doc.to<JsonObject>();
// ✅ Use update() to safely modify state
update(
[&](ModuleState& state) {
if (state.data["devices"].isNull()) state.data["devices"].to<JsonArray>();
JsonArray devices = state.data["devices"];
JsonObject device = JsonObject();
for (JsonObject dev : devices) {
if (dev["ip"] == ip.toString()) {
device = dev;
break;
}
}
if (device.isNull()) {
device = devices.add<JsonObject>();
EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], message.name);
device["ip"] = ip.toString();
}
device["time"] = time(nullptr);
device["name"] = message.name.c_str();
device["version"] = message.version.c_str();
device["uptime"] = message.uptime;
device["packageSize"] = message.packageSize;
device["lightsOn"] = message.lightsOn;
device["brightness"] = message.brightness;
device["palette"] = message.palette;
device["preset"] = message.preset;
return StateUpdateResult::CHANGED;
},
String(_moduleName) + "server"
);
// Note: sorting happens automatically via the update mechanism
// or move sorting logic here if needed
}This ensures all state access is mutex-protected, preventing the race condition. 📝 Learning AddedWould you like me to help implement the corrected 🐰 ✏️ Learnings added
🧠 Learnings used |
backend ======= - module: mutexedOriginId not pointer but value - devices: refactor updateDevices: only sort if new device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@lib/framework/StatefulService.h`:
- Around line 201-205: The current code warns when
xSemaphoreTakeRecursive(_accessMutex,...) times out twice but then continues
without owning _accessMutex, risking concurrent mutation of _state and an
unowned mutex in endTransaction(); update the failure path so after logging the
ESP_LOGW (keeping the warning) you either block until the mutex is acquired by
calling xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY) or explicitly abort
the transaction by returning an error/setting a failure state and ensuring
callers of the surrounding function (and endTransaction()) handle this abort;
refer to _accessMutex, xSemaphoreTakeRecursive, endTransaction(), _state,
originId, pcTaskGetName and xTaskGetCurrentTaskHandle() to locate and modify the
code.
In `@src/MoonBase/Module.h`:
- Around line 47-51: The static vs non-static mix causes cross-instance races;
make the queue per-module by converting mutexedUpdatedItem, mutexedOriginId, and
updateMutex to non-static instance members (so they match the existing
non-static mutexedUpdatePending), remove their static definitions from
Module.cpp, and initialize/create the updateMutex in the ModuleState constructor
and free it in the destructor; also update postUpdate() and getUpdate() usages
to refer to the instance members (mutexedUpdatedItem, mutexedOriginId,
updateMutex, mutexedUpdatePending) so each Module instance has its own queue and
semaphore.
In `@src/MoonBase/Modules/ModuleDevices.h`:
- Around line 137-176: The code mutates _state.data via the doc reference when
!_socket->getActiveClients(), causing a race; protect these writes by using the
provided update() API or acquiring _accessMutex for the entire block that
reads/creates devices, finds or adds device (the loop over devices), and then
sets device["time"], device["name"], device["version"], device["uptime"],
device["packageSize"], device["lightsOn"], device["brightness"],
device["palette"], and device["preset"]; specifically, replace the doc =
_state.data path with a copy via doc.set(_state.data) or wrap from the doc
creation through the final device[...] assignments in a scoped lock of
_accessMutex (or call update(...) so the update is mutex-protected) and leave
the early return if (!_socket->getActiveClients()) only after the protected
section.
♻️ Duplicate comments (1)
src/MoonBase/Modules/ModuleDevices.h (1)
100-100: Uptime calculation still returns boot timestamp, not uptime.
time(nullptr) - pal::millis() / 1000computes the boot epoch, not the uptime. If the field is meant to hold uptime seconds, usepal::millis() / 1000unconditionally.🔧 Suggested fix
- message.uptime = time(nullptr) ? time(nullptr) - pal::millis() / 1000 : pal::millis() / 1000; + message.uptime = pal::millis() / 1000;Also applies to line 258.
| if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(100))==pdFALSE) { | ||
| ESP_LOGI("🐼", "_accessMutex %s wait 100ms", originId.c_str()); | ||
| // ESP_LOGI("🐼", "_accessMutex %s wait 100ms", originId.c_str()); | ||
| if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) { | ||
| ESP_LOGW("🐼", "_accessMutex %s waited 500ms and continues", originId.c_str()); | ||
| ESP_LOGW("🐼", "%s _accessMutex %s waited 500ms and continues", pcTaskGetName(xTaskGetCurrentTaskHandle()), originId.c_str()); | ||
| // xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block or abort on mutex timeout — continuing without the lock is unsafe.
If both waits fail, the function proceeds without owning _accessMutex, so callers mutate _state concurrently and endTransaction() gives an unowned mutex. This matches the corruption issues observed in this repo. Please keep the warning (no auto‑restart), but do ensure the lock is acquired or fail the transaction.
✅ Safer fix (block after warning)
- if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
- ESP_LOGW("🐼", "%s _accessMutex %s waited 500ms and continues", pcTaskGetName(xTaskGetCurrentTaskHandle()), originId.c_str());
- // xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);
- }
+ if (xSemaphoreTakeRecursive(_accessMutex, pdMS_TO_TICKS(400))==pdFALSE) {
+ ESP_LOGW("🐼", "%s _accessMutex %s waited 500ms; blocking until acquired", pcTaskGetName(xTaskGetCurrentTaskHandle()), originId.c_str());
+ xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY);
+ }Based on learnings, the warning should remain but execution must not proceed without the mutex.
🤖 Prompt for AI Agents
In `@lib/framework/StatefulService.h` around lines 201 - 205, The current code
warns when xSemaphoreTakeRecursive(_accessMutex,...) times out twice but then
continues without owning _accessMutex, risking concurrent mutation of _state and
an unowned mutex in endTransaction(); update the failure path so after logging
the ESP_LOGW (keeping the warning) you either block until the mutex is acquired
by calling xSemaphoreTakeRecursive(_accessMutex, portMAX_DELAY) or explicitly
abort the transaction by returning an error/setting a failure state and ensuring
callers of the surrounding function (and endTransaction()) handle this abort;
refer to _accessMutex, xSemaphoreTakeRecursive, endTransaction(), _state,
originId, pcTaskGetName and xTaskGetCurrentTaskHandle() to locate and modify the
code.
| // for post/getUpdate | ||
| static UpdatedItem mutexedUpdatedItem; | ||
| static String mutexedOriginId; | ||
| static SemaphoreHandle_t updateMutex; | ||
| bool updatePending = false; // should not be static as each module needs to keep track of it's own status | ||
|
|
||
| static Char<20> updateOriginId; // static, written by ModuleState::update, no mutex needed as written by one process at a time (http mostly, sveltekit sometimes recursively) | ||
| bool mutexedUpdatePending = false; // should not be static as each module needs to keep track of it's own status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the Module.h file around lines 47-51 to see the actual code
head -60 src/MoonBase/Module.h | tail -20Repository: MoonModules/MoonLight
Length of output: 745
🏁 Script executed:
# Get more context from Module.h - check the class definition and static members
sed -n '40,160p' src/MoonBase/Module.hRepository: MoonModules/MoonLight
Length of output: 4742
🏁 Script executed:
# Check Module.cpp for static member definitions
fd -e cpp -x grep -l "ModuleState::" {} \; | head -5Repository: MoonModules/MoonLight
Length of output: 90
🏁 Script executed:
# Search for ModuleState instantiations to understand usage patterns
rg "ModuleState" --type cpp --type h -C3 | head -100Repository: MoonModules/MoonLight
Length of output: 5174
🏁 Script executed:
# Search for Module/ModuleState instantiation to understand usage patterns
rg "new Module|Module(" --type cpp --type h -B2 -A2 | head -60Repository: MoonModules/MoonLight
Length of output: 151
🏁 Script executed:
# Check for any guard or synchronization that might prevent multiple concurrent instances
rg "ModuleState" --type cpp -A5 -B5 | grep -E "(static|singleton|instance)" | head -20Repository: MoonModules/MoonLight
Length of output: 668
🏁 Script executed:
# Look at the template usage - how many StatefulService<ModuleState> instances are created
rg "StatefulService<ModuleState>" --type cpp --type h -C3Repository: MoonModules/MoonLight
Length of output: 352
🏁 Script executed:
# Search for Module instantiation in main and other files
rg "Module\(" --type cpp -B2 -A2 | head -80Repository: MoonModules/MoonLight
Length of output: 5646
🏁 Script executed:
# Count the number of module types/instances being created
fd -e cpp | xargs grep -l "Module(" | head -10Repository: MoonModules/MoonLight
Length of output: 101
🏁 Script executed:
# Look at main.cpp to see how many modules are instantiated
grep -n "Module" src/main.cpp | head -30Repository: MoonModules/MoonLight
Length of output: 2360
🏁 Script executed:
# Check the Module constructor to understand instantiation
sed -n '1,50p' src/MoonBase/Module.cppRepository: MoonModules/MoonLight
Length of output: 2036
Race condition confirmed: mixed static/non-static queue variables with multiple Module instances.
With 9+ different Module instances in the system (ModuleLightsControl, ModuleDevices, ModuleEffects, ModuleDrivers, ModuleLiveScripts, ModuleChannels, etc.), the current design creates a race condition:
mutexedUpdatedItemandmutexedOriginIdare static (shared across all modules)mutexedUpdatePendingis non-static (per-instance)
When Module A calls postUpdate(), it sets the shared static queue and its own instance flag. When Module B calls postUpdate() before Module A's getUpdate() runs, it overwrites the static queue data. Module A's getUpdate() then reads Module B's data from the corrupted queue.
Either make all three static (single shared queue) or all three non-static (per-module queues).
🔧 Option 1: Make mutexedUpdatePending static for a single shared queue
- bool mutexedUpdatePending = false; // should not be static as each module needs to keep track of it's own status
+ static bool mutexedUpdatePending;And in Module.cpp:
bool ModuleState::mutexedUpdatePending = false;🔧 Option 2: Make all three non-static for per-module queues
- static UpdatedItem mutexedUpdatedItem;
- static String mutexedOriginId;
+ UpdatedItem mutexedUpdatedItem;
+ String mutexedOriginId;Remove the corresponding static definitions from Module.cpp.
🤖 Prompt for AI Agents
In `@src/MoonBase/Module.h` around lines 47 - 51, The static vs non-static mix
causes cross-instance races; make the queue per-module by converting
mutexedUpdatedItem, mutexedOriginId, and updateMutex to non-static instance
members (so they match the existing non-static mutexedUpdatePending), remove
their static definitions from Module.cpp, and initialize/create the updateMutex
in the ModuleState constructor and free it in the destructor; also update
postUpdate() and getUpdate() usages to refer to the instance members
(mutexedUpdatedItem, mutexedOriginId, updateMutex, mutexedUpdatePending) so each
Module instance has its own queue and semaphore.
| if (_socket->getActiveClients()) { // rebuild the devices array | ||
| doc.set(_state.data); // copy | ||
| } else { | ||
| doc = _state.data; // reference | ||
| } | ||
|
|
||
| JsonObject device = JsonObject(); | ||
| // set the devices array | ||
| JsonArray devices; | ||
| devices = doc["devices"]; | ||
|
|
||
| // find out if we have a new device | ||
| JsonObject device = JsonObject(); | ||
| bool newDevice = true; | ||
| for (JsonObject dev : devices) { | ||
| if (dev["ip"] == ip.toString()) { | ||
| device = dev; | ||
| newDevice = false; | ||
| break; // found so leave for loop | ||
| // EXT_LOGD(ML_TAG, "updated ...%d %s", ip[3], name); | ||
| } | ||
| } | ||
| if (device.isNull()) { | ||
|
|
||
| // set the device object | ||
| if (newDevice) { | ||
| device = devices.add<JsonObject>(); | ||
| EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], name); | ||
| EXT_LOGD(ML_TAG, "added ...%d %s", ip[3], message.name); | ||
| device["ip"] = ip.toString(); | ||
| } | ||
|
|
||
| device["name"] = name; // name can change | ||
| device["time"] = time(nullptr); // time will change, triggering update | ||
| device["name"] = message.name.c_str(); | ||
| device["version"] = message.version.c_str(); | ||
| device["uptime"] = message.uptime; | ||
| device["packageSize"] = message.packageSize; | ||
| device["lightsOn"] = message.lightsOn; | ||
| device["brightness"] = message.brightness; | ||
| device["palette"] = message.palette; | ||
| device["preset"] = message.preset; | ||
|
|
||
| if (!_socket->getConnectedClients()) return; // no need to update if no clients | ||
| if (!_socket->getActiveClients()) return; // no need to update if no clients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: _state.data modified without mutex protection.
When !_socket->getActiveClients(), the assignment doc = _state.data creates a reference (per comment). Subsequent writes to device["time"], device["name"], etc. (lines 166–174) then mutate _state.data directly—without holding _accessMutex. Concurrent HTTP/WebSocket tasks can read corrupted JSON.
Use the update() API to ensure mutex-protected writes, or protect this entire section with the mutex. Based on learnings, this is a known root cause of JSON corruption and spurious updates.
🔧 Suggested refactor
void updateDevices(const UDPMessage& message, IPAddress ip) {
- // EXT_LOGD(ML_TAG, "updateDevices ...%d %s", ip[3], name);
- if (_state.data["devices"].isNull()) _state.data["devices"].to<JsonArray>();
-
- // set the doc
- JsonDocument doc;
- if (_socket->getActiveClients()) { // rebuild the devices array
- doc.set(_state.data); // copy
- } else {
- doc = _state.data; // reference
- }
+ // Always work on a copy, then push via update() for thread-safe mutation
+ JsonDocument doc;
+ doc.set(_state.data);🤖 Prompt for AI Agents
In `@src/MoonBase/Modules/ModuleDevices.h` around lines 137 - 176, The code
mutates _state.data via the doc reference when !_socket->getActiveClients(),
causing a race; protect these writes by using the provided update() API or
acquiring _accessMutex for the entire block that reads/creates devices, finds or
adds device (the loop over devices), and then sets device["time"],
device["name"], device["version"], device["uptime"], device["packageSize"],
device["lightsOn"], device["brightness"], device["palette"], and
device["preset"]; specifically, replace the doc = _state.data path with a copy
via doc.set(_state.data) or wrap from the doc creation through the final
device[...] assignments in a scoped lock of _accessMutex (or call update(...) so
the update is mutex-protected) and leave the early return if
(!_socket->getActiveClients()) only after the protected section.
Summary by CodeRabbit
New Features
Refactor
UI
Chores
✏️ Tip: You can customize this high-level summary in your review settings.