Skip to content

proxymode tunnel stability — immediate close, interface cleanup, keepalive tolerance#6195

Open
cvl wants to merge 3 commits into
mysteriumnetwork:masterfrom
shellroute:fix/proxymode-tunnel-stability
Open

proxymode tunnel stability — immediate close, interface cleanup, keepalive tolerance#6195
cvl wants to merge 3 commits into
mysteriumnetwork:masterfrom
shellroute:fix/proxymode-tunnel-stability

Conversation

@cvl
Copy link
Copy Markdown
Contributor

@cvl cvl commented May 28, 2026

Problem

WireGuard tunnels in --proxymode die within 60-251 seconds. Every tunnel, every country. The proxy server and connection object stay alive — nothing reports unhealthy until external probes discover the death seconds later.

Root causes:

  1. proxyclient.Close() defers Device.Close() by 2 minutes in a goroutine. Orphaned devices accumulate, each sending rogue UDP keepalives to old providers. With pool cycling, dozens of phantom devices pile up.

  2. ConfigureDevice() overwrites c.Device and c.proxyClose without closing the old ones. On reconfigure (e.g. auto-reconnect), the old netstack, WireGuard device, and HTTP proxy server leak forever. The old proxy holds the port, so the new one silently fails to bind in a background goroutine.

  3. ReleaseInterface() errors on every disconnect in proxymode ("allocated interface not found") because proxymode sets interface name to myst<port> without calling AllocateInterface(). cleanAbandonedInterfaces() is not skipped for proxymode either (only dVPN was exempted).

  4. The P2P keepalive loop auto-disconnects after 3 consecutive ping failures (15-30s). In proxymode, the gateway manages tunnel health via its own sentinel and probe mechanisms. The P2P channel (NATS/UDP signaling) is less stable than the WireGuard tunnel itself in Docker — the keepalive kills perfectly healthy tunnels.

Changes

services/wireguard/endpoint/proxyclient/client.go

  • Close(): close device immediately (was 2-minute deferred goroutine). Nil both Device and proxyClose to prevent double-close.
  • ConfigureDevice(): close old device and proxy before creating new ones. Old resources extracted under lock, closed outside lock.
  • Proxy(): synchronous net.Listen + server.Serve(ln) so port conflicts fail ConfigureDevice immediately instead of silently in a background goroutine. Suppress http.ErrServerClosed on normal shutdown.
  • PeerStats(): nil/lock guard on Device — returns error instead of panic when device is closed.

services/wireguard/endpoint/endpoint.go

  • Stop(): skip ReleaseInterface when ProxyPort > 0 (proxymode never called AllocateInterface).
  • StartConsumerMode(): same skip on configure failure path.
  • cleanAbandonedInterfaces(): skip in proxymode (same reason as dVPN — multiple concurrent connections should not destroy each other).

core/connection/manager.go

  • keepAliveLoop(): in proxymode (FlagProxyMode), reset error counter and continue instead of disconnecting/going OnHold. Log at warn level. The gateway's sentinel detects real tunnel death independently.

Tests

  • Test_Close_ReleasesDeviceImmediately — Device nil after Close
  • Test_Close_ProxyCloseCalledAndNilled — proxyClose invoked and cleared
  • Test_Close_Idempotent — double Close safe
  • Test_ConfigureDevice_CleansOldResources — old device/proxy cleaned on reconfigure
  • Test_ConfigureDevice_DoubleConfigureSamePort — two configures produce distinct devices
  • Test_PeerStats_NilDeviceReturnsError — nil guard returns error, no panic
  • Test_PeerStats_NilAfterClose — PeerStats safe after Close
  • Test_Proxy_SyncBindFailsOnPortConflict — port conflict detected synchronously

Results

Metric Before After
Tunnel lifespan 60-251 seconds (all die) 7-10+ minutes
"allocated interface not found" Every disconnect Gone
P2P keepalive kills Every 15-30s Never (proxymode)
Orphaned devices after disconnect 1 per disconnect (2 min each) 0

cvl added 3 commits May 28, 2026 17:53
- Close(): close device immediately instead of 2-minute deferred goroutine.
  Orphaned devices accumulated, sending rogue keepalives.
- ConfigureDevice(): close old device/proxy before creating new ones.
  ReConfigureDevice leaked old netstack, device, and proxy server.
- Proxy(): synchronous net.Listen + server.Serve so port conflicts
  fail ConfigureDevice immediately instead of silently in background.
- PeerStats(): nil/lock guard on Device to prevent panic after Close.
- Suppress http.ErrServerClosed log noise on normal shutdown.
- Stop(): skip ReleaseInterface when ProxyPort > 0. Proxymode names
  interfaces myst<port> without AllocateInterface, so release always
  errored with "allocated interface not found".
- StartConsumerMode(): same skip on configure failure path.
- cleanAbandonedInterfaces(): skip in proxymode (same reason as dVPN —
  multiple concurrent connections should not destroy each other).
In proxymode the gateway manages tunnel health via its own sentinel
and probe mechanisms. The P2P keepalive failure (3 × 5s) was killing
perfectly healthy WireGuard tunnels because the P2P channel (NATS/UDP
signaling) is less stable than the tunnel itself in Docker.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.75676% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 25.76%. Comparing base (8166036) to head (b170d30).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
services/wireguard/endpoint/endpoint.go 0.00% 6 Missing ⚠️
core/connection/manager.go 0.00% 5 Missing ⚠️
services/wireguard/endpoint/proxyclient/client.go 80.76% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6195      +/-   ##
==========================================
+ Coverage   25.60%   25.76%   +0.16%     
==========================================
  Files         540      540              
  Lines       31160    31180      +20     
==========================================
+ Hits         7977     8033      +56     
+ Misses      22392    22347      -45     
- Partials      791      800       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants