Skip to content

Fixed silent loss of reliable data on send failure and resume replay#921

Open
adrian-niculescu wants to merge 1 commit into
livekit:mainfrom
adrian-niculescu:fix-senddata-ignored-result
Open

Fixed silent loss of reliable data on send failure and resume replay#921
adrian-niculescu wants to merge 1 commit into
livekit:mainfrom
adrian-niculescu:fix-senddata-ignored-result

Conversation

@adrian-niculescu
Copy link
Copy Markdown
Contributor

@adrian-niculescu adrian-niculescu commented Apr 30, 2026

Summary

We rely on reliable LocalParticipant.publishData(...) for application-level signaling where every message must land, which is what drew my attention here.

DataChannel.send(Buffer) returns a Boolean that is false when the SCTP send queue is full, the channel is not OPEN, or the native layer rejects the buffer. RTCEngine.sendData discarded that result and always returned Result.success(Unit), so callers using LocalParticipant.publishData(...).getOrThrow() to drive retry logic could not see locally rejected packets and would treat them as delivered.

While there I also looked at the resume replay path. The Android WebRTC wrapper drains a Buffer via ByteBuffer.get(byte[]), which advances the buffer's position to its limit. reliableMessageBuffer stores each item as a ByteBuffer, and resendReliableMessagesForResume passed the same ByteBuffer instance into DataChannel.send on every replay. If a buffered item survived more than one resume attempt without server progress on lastMessageSeq (back-to-back reconnects), the second and subsequent replays sent an empty buffer and DataChannel.send returned true, so no caller could detect the loss.

resendReliableMessagesForResume also discarded DataChannel.send's boolean return value during replay, and the reconnect caller discarded the function's Result. An SCTP-rejected replay would therefore be silently dropped while the reconnect continued to client.onPCConnected(), marking the connection resumed.

Fix

  • Check the return of channel.send(buf) and surface Result.failure(RoomException.ConnectException(...)) on false.

  • For RELIABLE packets, defer reliableMessageBuffer queueing and reliableDataSequence advancement until after a successful send. A failed send no longer burns a sequence number or leaves a stale entry that would replay on resume; a caller retry reuses the same sequence.

  • The RECONNECTING fast-path (queue and return success without sending) is unchanged.

  • Keep DataPacketItem.data as ByteBuffer (the public type is part of the SDK ABI) and pass a fresh ByteBuffer.wrap(packetBytes) to DataChannel.send while queueing a separate ByteBuffer.wrap(packetBytes) in reliableMessageBuffer. In resendReliableMessagesForResume, send item.data.duplicate() per replay so the buffered item's position stays at 0 and survives multiple resume attempts.

  • In resendReliableMessagesForResume, check channel.send(...)'s boolean per replay and return Result.failure(RoomException.ConnectException(...)) on the first rejected item. The reconnect caller now logs the failure via LKLog.w instead of discarding the Result. Buffered items remain queued for the next legitimate soft reconnect.

    Whether the reconnect path should also react to a failed replay (e.g., await bufferedAmount low and retry in place, escalate to onEngineDisconnected, or have the full-reconnect path attempt reliable replay too) is a policy/behavior decision worth its own discussion. Note: simply continue-ing the reconnect loop on failure is not equivalent to "retry the resume" — lastMessageSeq is only set in the soft-reconnect branch (line 614), so the next iteration would upgrade to a full reconnect (line 574, retries != 0) and silently skip the replay (line 673), leaving buffered items orphaned. Open to direction from maintainers on what they'd prefer here; happy to do it in a follow-up.

Test plan

  • publishDataReturnsFailureWhenDataChannelSendFails in LocalParticipantMockE2ETest asserts:
    • failed send() produces Result.failure with RoomException.ConnectException
    • on retry after recovery, sentBuffers.size == 2 and the retried packet's sequence is 1 (proving the failed attempt did not consume a sequence).
  • resendReliableMessagesReplaysFullPayloadAcrossMultipleResumes in RTCEngineMockE2ETest publishes a reliable message and calls resendReliableMessagesForResume(0) twice with consumeSentBuffer = true (mirrors WebRTC's drain). It parses both replayed payloads back into DataPacket and asserts the original bytes survive repeated replay.
  • resendReliableMessagesReturnsFailureWhenReplaySendFails in RTCEngineMockE2ETest flips MockDataChannel.sendResult = false after a publish and asserts resendReliableMessagesForResume(0) returns Result.failure with RoomException.ConnectException.
  • MockDataChannel gained a sentPayloads list (snapshot of bytes visible at send time) and a consumeSentBuffer flag.
  • ./gradlew livekit-android-test:testDebugUnitTest --tests "io.livekit.android.room.RTCEngineMockE2ETest" --tests "io.livekit.android.room.participant.LocalParticipantMockE2ETest" --tests "io.livekit.android.room.RoomDataMockE2ETest" --tests "io.livekit.android.room.RoomReconnectionMockE2ETest" --rerun-tasks passes.
  • ./gradlew spotlessApply clean.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 30, 2026

🦋 Changeset detected

Latest commit: 0552bcc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
client-sdk-android Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@adrian-niculescu adrian-niculescu changed the title Fixed RTCEngine.sendData ignoring DataChannel.send return value Fixed silent loss of reliable data on send failure and resume replay Apr 30, 2026
@adrian-niculescu adrian-niculescu marked this pull request as draft April 30, 2026 23:10
@davidliu davidliu self-requested a review May 5, 2026 09:23
@adrian-niculescu adrian-niculescu force-pushed the fix-senddata-ignored-result branch 3 times, most recently from f186d3d to 730bce7 Compare May 5, 2026 14:36
@adrian-niculescu adrian-niculescu marked this pull request as ready for review May 5, 2026 14:38
@adrian-niculescu adrian-niculescu force-pushed the fix-senddata-ignored-result branch 2 times, most recently from 2e134fe to a547ce8 Compare May 5, 2026 16:29
@adrian-niculescu
Copy link
Copy Markdown
Contributor Author

@davidliu I saw your changes in #925 that wired up livekit-android-test as a friend module of livekit-android-sdk. That made testing internal APIs much easier. I leaned on that here as well: dropped the @VisibleForTesting(otherwise = NONE) shim I'd added and kept resendReliableMessagesForResume as internal, since the test module can now resolve it directly. PR description updated to match; regression tests still pass.

@adrian-niculescu adrian-niculescu force-pushed the fix-senddata-ignored-result branch from a547ce8 to 0552bcc Compare May 5, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant