Skip to content

fix: retry incomplete blob downloads after sync#88

Open
cbenhagen wants to merge 1 commit into
n0-computer:mainfrom
cbenhagen:fix/retry-incomplete-blob-downloads
Open

fix: retry incomplete blob downloads after sync#88
cbenhagen wants to merge 1 commit into
n0-computer:mainfrom
cbenhagen:fix/retry-incomplete-blob-downloads

Conversation

@cbenhagen
Copy link
Copy Markdown
Contributor

Description

After a successful sync, blob downloads could silently fail to retry in several scenarios:

  1. Peer lacked the content: A sync provides entry metadata but the remote peer doesn't have the blob data. The hash is added to the in-memory missing_hashes set, but no RemoteInsert event fires on subsequent syncs with other peers who do have the blob.
  2. Restart loses state: The missing_hashes set is in-memory only. After a node restart, hashes that were pending download are forgotten and never retried.
  3. Download policy change: A node that initially excluded certain keys via DownloadPolicy would never download their blobs after switching to a more permissive policy, because no new RemoteInsert is emitted for entries already in the store.
  4. Download failure from flaky peer: A peer with poor connectivity writes a key. Other peers receive the entry metadata via gossip and attempt to download the blob, but the transfer fails due to the flaky connection. The hash ends up in missing_hashes, but nothing triggers a retry — no new RemoteInsert fires (the entry is already known), and no ContentReady is broadcast (no one else has the blob). The download is effectively stuck until either the flaky peer comes back and a new sync happens, or someone else acquires and advertises the blob.

This PR adds check_incomplete_blobs, called at the end of every successful sync. It scans the namespace for entries whose blobs are not yet complete (respecting the current download policy) and queues them for download with the sync peer as a candidate provider. Since start_download already short-circuits for complete blobs and deduplicates via queued_hashes, this is safe to call unconditionally.

Four integration tests cover the fix:

  • sync_retries_blob_from_second_peer — B syncs with C (metadata only), then A (has blob)
  • sync_retries_blob_after_policy_change — B excludes a key, changes policy, re-syncs
  • sync_retries_missing_blob_propagates_content_ready — blob propagates from A → B → C
  • sync_retries_blob_after_restart — persistent node restarts and re-downloads

Scenario 4 (download failure from a flaky peer) is not directly tested because reliably simulating a transport-level download failure in an integration test would require either mocking the downloader or precise timing control over a node going offline mid-transfer — neither is practical with the current test infrastructure. However, the code path is the same: regardless of why a blob is incomplete (policy, missing remote content, failed download, or lost state), check_incomplete_blobs finds it by checking the blob store directly and queues a fresh download.

Closes #82

/cc @gusinacio @rustonbsd

Breaking Changes

None.

Notes & open questions

  • check_incomplete_blobs queries all entries in the namespace on every sync. For documents with a very large number of entries this adds overhead, though start_download returns immediately for already-complete blobs. A future optimization could track incomplete hashes more precisely to avoid the full scan.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@cbenhagen cbenhagen force-pushed the fix/retry-incomplete-blob-downloads branch from d77bcd0 to fcd48d3 Compare March 31, 2026 10:25
@n0bot n0bot Bot added this to iroh Mar 31, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh Mar 31, 2026
@dignifiedquire dignifiedquire moved this from 🚑 Needs Triage to 👀 In review in iroh Apr 7, 2026
@Frando
Copy link
Copy Markdown
Member

Frando commented Apr 20, 2026

Thank you for this PR, this is well-motivated and checking all blobs is likely the best we can do without larger refactors.

However, I'm a bit worried about calling this unconditionally after each sync. In large swarms with large docs this will have a considerable overhead. I will need to take a closer look and think a bit how we can best reduce that cost.

@cbenhagen
Copy link
Copy Markdown
Contributor Author

Thanks for looking into this @Frando. The goal was indeed to improve reliability without a big refactor. I agree this needs some more love. Maybe we just debounce per namespace per N seconds?

In https://discord.com/channels/1161119546170687619/1469252711663927318 @gusinacio shared his thoughts after testing this PR:

I just tried it out and it doesn't really work for my use-case.
I have one big doc that is being used by 20+ authors and there are different users with different download policies.

That means that the missing hash should be downloaded from any peer, and it seems this is not solving that issue.
I currently have what I call a "Resyncer" which is an actor that keep track of a list of blobs to try to download and a list of peers where it could try to download, maintaining backoffs per peer and per blob hash (when applicable), and cleaning up when the download is completed, emitting events to who subscribe to it.
This also controls concurrency of blobs download.
This was the most reliable way that I had to download blobs as fast as possible.

I think the hardest part is the initial sync, when a new peer tries to load our 600+ keys docs.

Key sync is the beast, it works amazingly well. But after that, it just halts trying to download the blobs hashes (on Docs side). Even this PR didn't solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

Sync with other peers never re-downloads blobs

3 participants