Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions GOSSIP_STALE_NAMESPACE_BUG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Bug: Stale namespace persists in gossip mesh after publisher disconnect

## Summary

In a full-mesh cluster (dev branch), a namespace announced via gossip survives publisher disconnects and relay restarts. Subscribers receive SUBSCRIBE_OK followed by PUBLISH_DONE status=500 because the relay routes to a gossip peer that has no actual data.

## Environment

- Branch: `dev`
- Cluster: 6 relays in full mesh (`[cluster] connect` lists all peers)
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport

## Steps to Reproduce

1. Start all 6 relays with full-mesh config
2. Start publisher → connects to relay A, announces namespace `test`
3. Verify: subscriber connects to any relay, plays successfully
4. Stop the publisher
5. Stop relay A (the origin)
6. Restart relay A
7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500

The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.
Comment on lines +11 to +24
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify effective namespace/path to avoid false-negative reproduction

Please explicitly state the effective broadcast name seen by the browser subscriber (e.g., anon/test vs test). Without this, reproduction can be inconsistent when publisher transport differs from WebTransport.

✍️ Suggested doc patch
 - Publisher: moq-lite-02 client connecting to one relay (the "origin")
 - Subscriber: browser via moq-transport-14 / WebTransport
+ - Namespace note: WebTransport sessions prepend URL path prefixes (for example `/anon`),
+   so browser-visible namespace may be `anon/test` even if publisher announces `test`.
+   If publishing via raw QUIC/iroh, include the prefix manually when needed.

Based on learnings: In moq-relay, WebTransport automatically prepends URL paths (e.g., /anon), while raw QUIC/iroh publishers must manually add that prefix to broadcast names.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport
## Steps to Reproduce
1. Start all 6 relays with full-mesh config
2. Start publisher → connects to relay A, announces namespace `test`
3. Verify: subscriber connects to any relay, plays successfully
4. Stop the publisher
5. Stop relay A (the origin)
6. Restart relay A
7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500
The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.
- Publisher: moq-lite-02 client connecting to one relay (the "origin")
- Subscriber: browser via moq-transport-14 / WebTransport
- Namespace note: WebTransport sessions prepend URL path prefixes (for example `/anon`),
so browser-visible namespace may be `anon/test` even if publisher announces `test`.
If publishing via raw QUIC/iroh, include the prefix manually when needed.
## Steps to Reproduce
1. Start all 6 relays with full-mesh config
2. Start publisher → connects to relay A, announces namespace `test`
3. Verify: subscriber connects to any relay, plays successfully
4. Stop the publisher
5. Stop relay A (the origin)
6. Restart relay A
7. Subscriber connects to relay A → gets SUBSCRIBE_OK then PUBLISH_DONE status=500
The namespace `test` is stuck. Restarting relay A alone does not fix it. Only restarting ALL relays within ~10 seconds clears the stale entry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GOSSIP_STALE_NAMESPACE_BUG.md` around lines 11 - 24, Update the reproduction
steps to explicitly state the effective broadcast path the browser subscriber
sees (e.g., use "anon/test" vs "test") wherever the namespace `test` is
referenced; clarify in the Steps to Reproduce and the Publisher/Subscriber
bullets that moq-relay's WebTransport prepends URL paths (e.g., /anon) while raw
QUIC/iroh publishers must include that prefix themselves, so readers know to
check the actual broadcast name the browser sees and avoid false-negative
repros.


## Root Cause Analysis

### How gossip propagates the namespace

When relay A's publisher announces `test`:
1. Relay A stores it locally via `publish_broadcast()` (`rs/moq-lite/src/model/origin.rs:430`)
2. Relay A's gossip connections publish it to peers B-F via `run_remote_once()` (`rs/moq-relay/src/cluster.rs:108`): `.with_publish(self.origin.consume())`
3. Peers B-F each store `test` in their own origin and re-publish to their peers
4. All 6 relays now have `test` in their origin

### How cleanup should work

When the publisher disconnects:
1. The `BroadcastConsumer` on relay A closes
2. The spawned task in `publish_broadcast()` (`origin.rs:444`) awaits `broadcast.closed()`, then calls `remove()` → `unannounce()`
3. The unannounce should propagate to peers

### Why it fails

When relay A restarts:
1. Relay A's local broadcast is gone (process died)
2. Peers B-F still have `test` in their origins (received via gossip)
3. Relay A reconnects to peers via `run_remote_once()` (`cluster.rs:108`)
4. Each peer publishes `test` back to relay A via `.with_publish(self.origin.consume())`
5. Relay A now has `test` again — but it came from gossip, not a real publisher
6. Subscriber connects, relay A finds `test` via gossip, tries to fetch from a peer, peer has no data → PUBLISH_DONE 500

The `broadcast.closed()` cleanup on peers should fire when the gossip connection to relay A drops. But the reconnection happens immediately (backoff starts at 1s in `cluster.rs:94`), and the new connection re-publishes before the old consumer's close task runs.

The `REANNOUNCE_HOLD_DOWN` of 250ms (`origin.rs:15`) is designed for cascading closures, but the reconnect+re-publish race is faster.

### Key code paths

| File | Line | Description |
|------|------|-------------|
| `rs/moq-lite/src/model/origin.rs` | 430 | `publish_broadcast()` — stores broadcast, spawns close watcher |
| `rs/moq-lite/src/model/origin.rs` | 444 | `broadcast.closed().await` — cleanup on consumer close |
| `rs/moq-lite/src/model/origin.rs` | 445 | `root.lock().remove()` → triggers unannounce |
| `rs/moq-lite/src/model/origin.rs` | 15 | `REANNOUNCE_HOLD_DOWN = 250ms` |
| `rs/moq-relay/src/cluster.rs` | 108-122 | `run_remote_once()` — connects to peer, publishes+consumes full origin |
| `rs/moq-relay/src/cluster.rs` | 94 | Backoff starts at 1s, doubles on failure |

### The race condition

```
Time 0: Relay A dies. Connection to peer B drops.
Time 0+ε: Peer B's BroadcastConsumer for "test" should start closing
Time 1s: Peer B reconnects to relay A (backoff=1s)
Time 1s+ε: Peer B publishes "test" to relay A via .with_publish(self.origin.consume())
Time ???: Peer B's old consumer close task finally runs → unannounce
But "test" was already re-published via the new connection
```
Comment on lines +70 to +77
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to this fenced code block

Line 70 uses an untyped code fence, which triggers markdownlint MD040.

✅ Suggested fix
-```
+```text
 Time 0:     Relay A dies. Connection to peer B drops.
 Time 0+ε:   Peer B's BroadcastConsumer for "test" should start closing
 Time 1s:    Peer B reconnects to relay A (backoff=1s)
 Time 1s+ε:  Peer B publishes "test" to relay A via .with_publish(self.origin.consume())
 Time ???:   Peer B's old consumer close task finally runs → unannounce
             But "test" was already re-published via the new connection
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 70-70: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @GOSSIP_STALE_NAMESPACE_BUG.md around lines 70 - 77, The fenced code block in
the GOSSIP_STALE_NAMESPACE_BUG.md snippet is missing a language tag (triggers
markdownlint MD040); update the triple-backtick fence that contains the timeline
lines starting with "Time 0:" to include a language identifier (e.g., change totext) so the block is explicitly typed and linting passes—locate the
timeline block in the file (the one with "Time 0: Relay A dies..." through "But
"test" was already re-published via the new connection") and add the language
tag to the opening fence.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->


The gossip consumer on peer B is a **different** `BroadcastConsumer` than the original publisher's. When peer B's connection to relay A drops, peer B's consumer for the relay A session closes. But `origin.rs` has backup logic — the same namespace from other gossip paths (via peers C-F) keeps the broadcast alive on peer B.

## Possible Fixes

1. **Reconnection delay**: Wait for `REANNOUNCE_HOLD_DOWN` (or longer) after a peer disconnects before re-publishing its namespaces on a new connection. This gives the close task time to propagate.

2. **Hop-aware cleanup**: When a gossip connection drops, immediately unannounce all namespaces that were learned exclusively through that connection (no other path with equal or fewer hops).

3. **Publisher heartbeat**: Require the original publisher to periodically refresh its announce. If no refresh within N seconds, the namespace is unannounced globally.

4. **Admin purge**: Add a relay CLI command or HTTP endpoint to purge a namespace from the mesh.

## Logs

### Subscriber sees repeated PUBLISH_NAMESPACE from every gossip peer, then 500:

```
[MoQT] SUBSCRIBE id=0 track="catalog"
[MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK
[MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog"
[MoQT] PUBLISH_DONE id=0 status=500
[MoQT] PUBLISH_NAMESPACE id=3 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=5 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=7 ns="test" — sending OK
[MoQT] PUBLISH_NAMESPACE id=9 ns="test" — sending OK
...
```
Comment on lines +95 to +105
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language tags to log code fences

Line 95 and Line 109 also use untyped fences (MD040). Use text for logs.

✅ Suggested fix
-```
+```text
 [MoQT] SUBSCRIBE id=0 track="catalog"
 [MoQT] PUBLISH_NAMESPACE id=1 ns="test" — sending OK
 [MoQT] SUBSCRIBE_OK id=0 alias=0 track="catalog"
 [MoQT] PUBLISH_DONE id=0 status=500
 ...

```diff
-```
+```text
 WARN conn{id=56}: transport error err=session error: connection error: closed
 INFO remote{remote=origin}: connecting to remote url=https://origin/
 WARN remote{remote=origin}: QUIC connection failed err=timed out
 ...
</details>




Also applies to: 109-116

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 95-95: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @GOSSIP_STALE_NAMESPACE_BUG.md around lines 95 - 105, The markdown contains
untyped code fences for log output (the fence that starts with "[MoQT] SUBSCRIBE
id=0 track="catalog"" and the fence that starts with "WARN conn{id=56}:
transport error..."); update those fences to use a language tag of text (i.e.,
change totext) for both the block around lines ~95 and the block around
lines ~109-116 so the logs are treated as plain text and satisfy MD040.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->


### Edge relay during origin restart — disconnect, reconnect, timeout cycle:

```
WARN conn{id=56}: transport error err=session error: connection error: closed
INFO remote{remote=origin}: connecting to remote url=https://origin/
WARN remote{remote=origin}: QUIC connection failed err=timed out
WARN remote{remote=peer}: transport error err=session error: connection error: closed
INFO remote{remote=peer}: connecting to remote url=https://peer/
WARN remote{remote=peer}: QUIC connection failed err=timed out
```
Loading