fix(pushsync): prevent silent chunk loss on shallow receipts#5390
fix(pushsync): prevent silent chunk loss on shallow receipts#5390gacevicljubisa wants to merge 7 commits intomasterfrom
Conversation
Updated comments to correct formatting and clarity.
| // error budget and wait for any other inflight parallel pushes | ||
| // (e.g. multiplex forwards) before giving up. Only return | ||
| // ErrShallowReceipt once the entire budget is spent. | ||
| shallowReceiptResult = result.receipt |
There was a problem hiding this comment.
you've removed the skiplist addition line that was on the left side of the diff (L493). any particular reason?
There was a problem hiding this comment.
actually it is not removed, it is on the common failure path at pushsync.go:524
| // (e.g. multiplex forwards) before giving up. Only return | ||
| // ErrShallowReceipt once the entire budget is spent. | ||
| shallowReceiptResult = result.receipt | ||
| result.err = err |
There was a problem hiding this comment.
not sure if I understand this fully, but this statement and the next one in the else block seem like assigned but never used where is result.err used?
| // Retry budget exhausted: no peer in the correct neighborhood stored the | ||
| // chunk. Report CouldNotSync rather than Synced to avoid falsely marking | ||
| // the chunk as delivered. | ||
| if err := s.storer.Report(ctx, op.Chunk, storage.ChunkCouldNotSync); err != nil { |
There was a problem hiding this comment.
Changing to ChunkCouldNotSync, will cause some false reporting because the ChunkCouldNotSync state is not checked and updated in uploadstore.go ->Report function. Maybe you would need to alsi include in that switch statement also ChunkCouldNotSync.
| // is outside our AOR we would store it in a low bin and unreserve() | ||
| // would evict it almost immediately, leaving the chunk nowhere on the | ||
| // network while the origin believes it was delivered. | ||
| if swarm.Proximity(ps.address.Bytes(), chunkAddress.Bytes()) < rad { |
There was a problem hiding this comment.
What happens if you have doubling set on, and the chunk is saved into a sister neighborhood? Should we have here a check based on the committed depth?
Checklist
Description
Chunks could silently disappear from the network after a successful upload due to three bugs in the pushsync/pusher pipeline.
Root causes & fixes
1. Storer stored chunks it was about to evict (root cause)
ErrWantSelf) but the chunk was outside its AOR, it stored the chunk and sent a receipt anywayunreserve(), leaving the network with no copyErrOutOfDepthStoringwhenErrWantSelffires butproximity(chunk, self) < radius; the origin treats it as a hard failure and retries with the next peer2. Shallow receipt abandoned inflight parallel pushes
pushToClosestreturned immediately, discarding results from other parallel multiplex pushes that could have been validErrShallowReceiptonly after the full budget is exhausted3. False
ChunkSyncedafter retry exhaustionChunkSyncedeven though no node in the correct neighbourhood had stored themChunkCouldNotSyncinsteadOpen API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):