Skip to content

fix(pushsync): prevent silent chunk loss on shallow receipts#5390

Open
gacevicljubisa wants to merge 7 commits intomasterfrom
fix/pushsync-out-of-depth-chunk-loss
Open

fix(pushsync): prevent silent chunk loss on shallow receipts#5390
gacevicljubisa wants to merge 7 commits intomasterfrom
fix/pushsync-out-of-depth-chunk-loss

Conversation

@gacevicljubisa
Copy link
Member

@gacevicljubisa gacevicljubisa commented Mar 9, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Chunks could silently disappear from the network after a successful upload due to three bugs in the pushsync/pusher pipeline.

Root causes & fixes

1. Storer stored chunks it was about to evict (root cause)

  • When a forwarder had no closer peer (ErrWantSelf) but the chunk was outside its AOR, it stored the chunk and sent a receipt anyway
  • The chunk landed in a low proximity bin and was evicted almost immediately by unreserve(), leaving the network with no copy
  • Fix: return the pre-existing but unused ErrOutOfDepthStoring when ErrWantSelf fires but proximity(chunk, self) < radius; the origin treats it as a hard failure and retries with the next peer

2. Shallow receipt abandoned inflight parallel pushes

  • On any shallow receipt, pushToClosest returned immediately, discarding results from other parallel multiplex pushes that could have been valid
  • Only 1 of the 32-attempt error budget was used per call
  • Fix: treat shallow receipt as a regular peer failure — burn the budget, skip the peer, retry; surface ErrShallowReceipt only after the full budget is exhausted

3. False ChunkSynced after retry exhaustion

  • After 6 pusher-level retries, chunks were marked ChunkSynced even though no node in the correct neighbourhood had stored them
  • Fix: report ChunkCouldNotSync instead

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

gacevicljubisa and others added 2 commits March 9, 2026 12:48
Updated comments to correct formatting and clarity.
@gacevicljubisa gacevicljubisa marked this pull request as ready for review March 10, 2026 14:33
@gacevicljubisa gacevicljubisa requested review from acud and janos March 10, 2026 14:33
// error budget and wait for any other inflight parallel pushes
// (e.g. multiplex forwards) before giving up. Only return
// ErrShallowReceipt once the entire budget is spent.
shallowReceiptResult = result.receipt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've removed the skiplist addition line that was on the left side of the diff (L493). any particular reason?

Copy link
Member Author

@gacevicljubisa gacevicljubisa Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it is not removed, it is on the common failure path at pushsync.go:524

// (e.g. multiplex forwards) before giving up. Only return
// ErrShallowReceipt once the entire budget is spent.
shallowReceiptResult = result.receipt
result.err = err
Copy link
Contributor

@acud acud Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand this fully, but this statement and the next one in the else block seem like assigned but never used where is result.err used?

// Retry budget exhausted: no peer in the correct neighborhood stored the
// chunk. Report CouldNotSync rather than Synced to avoid falsely marking
// the chunk as delivered.
if err := s.storer.Report(ctx, op.Chunk, storage.ChunkCouldNotSync); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to ChunkCouldNotSync, will cause some false reporting because the ChunkCouldNotSync state is not checked and updated in uploadstore.go ->Report function. Maybe you would need to alsi include in that switch statement also ChunkCouldNotSync.

// is outside our AOR we would store it in a low bin and unreserve()
// would evict it almost immediately, leaving the chunk nowhere on the
// network while the origin believes it was delivered.
if swarm.Proximity(ps.address.Bytes(), chunkAddress.Bytes()) < rad {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you have doubling set on, and the chunk is saved into a sister neighborhood? Should we have here a check based on the committed depth?

@gacevicljubisa gacevicljubisa added the on hold Temporarily halted by other development label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

on hold Temporarily halted by other development testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants