Skip to content

feat(pqm): opt-in findpeer fallback on dial fail#1156

Draft
lidel wants to merge 1 commit into
mainfrom
feat/providerquerymanager-findpeer-fallback
Draft

feat(pqm): opt-in findpeer fallback on dial fail#1156
lidel wants to merge 1 commit into
mainfrom
feat/providerquerymanager-findpeer-fallback

Conversation

@lidel
Copy link
Copy Markdown
Member

@lidel lidel commented May 16, 2026

Problem

In some setups where DHT is semi-separate from provider lookup, when a provider's routing-record AddrInfo was stale or thin, the single dial in ProviderQueryManager failed and the provider was discarded, even when the DHT already knew reachable addresses for that peer.

Surfaced in ipfs/service-worker-gateway#1067, where rainbow's bitswap host could not connect to peers the DHT-side view could.

How this helps

This PR upstreams the idea from ipfs/rainbow#372 and adds opt-in WithFindPeerFallback(peerRouter) to routing/providerquerymanager. When enabled, on dial failure the manager calls FindPeer once and retries the dial with the returned AddrInfo, but only if FindPeer surfaced at least one address that wasn't in the set just tried. Default behavior is unchanged; wire with a DHT client to enable.

Note

ipfs/rainbow#372 perfectly fixes the problem from ipfs/service-worker-gateway#1067, kubo and other users of kad-dht are not imacted this, it was specific to special setup Rainbow uses.

This PR is just an idea how we could make the opt-infix generic. Ok to decide to not merge this if this does not feel useful outside of Rainbow.

when the first dial to a provider fails and a peer router is
configured, ask peerrouting.FindPeer once and retry the dial with
the returned addrinfo. only retry if FindPeer surfaced at least one
address that wasn't already in the routing-record set we just
tried; redialing the same broken set would just burn another round.

opt-in via the new WithPeerRouting option; default behavior is
unchanged. originally surfaced by ipfs/service-worker-gateway#1067,
where rainbow's bitswap host could not reach a peer the dht-side
view already knew how to find.
@lidel lidel changed the title feat(providerquerymanager): findpeer fallback on dial fail feat(pqm): findpeer fallback on dial fail May 16, 2026
@lidel lidel changed the title feat(pqm): findpeer fallback on dial fail feat(pqm): opt-in findpeer fallback on dial fail May 16, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.23%. Comparing base (4ea1540) to head (442c034).

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1156      +/-   ##
==========================================
+ Coverage   63.15%   63.23%   +0.07%     
==========================================
  Files         267      267              
  Lines       26867    26886      +19     
==========================================
+ Hits        16969    17002      +33     
+ Misses       8173     8162      -11     
+ Partials     1725     1722       -3     
Files with missing lines Coverage Δ
...uting/providerquerymanager/providerquerymanager.go 83.45% <100.00%> (+0.43%) ⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant