fix: retry incomplete blob downloads after sync#88
Conversation
d77bcd0 to
fcd48d3
Compare
|
Thank you for this PR, this is well-motivated and checking all blobs is likely the best we can do without larger refactors. However, I'm a bit worried about calling this unconditionally after each sync. In large swarms with large docs this will have a considerable overhead. I will need to take a closer look and think a bit how we can best reduce that cost. |
|
Thanks for looking into this @Frando. The goal was indeed to improve reliability without a big refactor. I agree this needs some more love. Maybe we just debounce per namespace per N seconds? In https://discord.com/channels/1161119546170687619/1469252711663927318 @gusinacio shared his thoughts after testing this PR:
|
Description
After a successful sync, blob downloads could silently fail to retry in several scenarios:
missing_hashesset, but noRemoteInsertevent fires on subsequent syncs with other peers who do have the blob.missing_hashesset is in-memory only. After a node restart, hashes that were pending download are forgotten and never retried.DownloadPolicywould never download their blobs after switching to a more permissive policy, because no newRemoteInsertis emitted for entries already in the store.missing_hashes, but nothing triggers a retry — no newRemoteInsertfires (the entry is already known), and noContentReadyis broadcast (no one else has the blob). The download is effectively stuck until either the flaky peer comes back and a new sync happens, or someone else acquires and advertises the blob.This PR adds
check_incomplete_blobs, called at the end of every successful sync. It scans the namespace for entries whose blobs are not yet complete (respecting the current download policy) and queues them for download with the sync peer as a candidate provider. Sincestart_downloadalready short-circuits for complete blobs and deduplicates viaqueued_hashes, this is safe to call unconditionally.Four integration tests cover the fix:
sync_retries_blob_from_second_peer— B syncs with C (metadata only), then A (has blob)sync_retries_blob_after_policy_change— B excludes a key, changes policy, re-syncssync_retries_missing_blob_propagates_content_ready— blob propagates from A → B → Csync_retries_blob_after_restart— persistent node restarts and re-downloadsScenario 4 (download failure from a flaky peer) is not directly tested because reliably simulating a transport-level download failure in an integration test would require either mocking the downloader or precise timing control over a node going offline mid-transfer — neither is practical with the current test infrastructure. However, the code path is the same: regardless of why a blob is incomplete (policy, missing remote content, failed download, or lost state),
check_incomplete_blobsfinds it by checking the blob store directly and queues a fresh download.Closes #82
/cc @gusinacio @rustonbsd
Breaking Changes
None.
Notes & open questions
check_incomplete_blobsqueries all entries in the namespace on every sync. For documents with a very large number of entries this adds overhead, thoughstart_downloadreturns immediately for already-complete blobs. A future optimization could track incomplete hashes more precisely to avoid the full scan.Change checklist