Skip to content

restart dead tunnels#964

Draft
rasdani wants to merge 2 commits intomainfrom
daniel/restart-dead-tunnels
Draft

restart dead tunnels#964
rasdani wants to merge 2 commits intomainfrom
daniel/restart-dead-tunnels

Conversation

@rasdani
Copy link
Contributor

@rasdani rasdani commented Feb 27, 2026

Description

monitors and restarts tunnels that have died in CliAgentEnv and RolloutGatewayEnv

depends on PrimeIntellect-ai/prime#403

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Changes add automatic tunnel restarts and new failure paths (TunnelError) during rollouts, which can alter runtime behavior and error handling in production environments. Risk is moderate because it touches long-running background tasks and sandbox rollout completion loops.

Overview
Improves tunnel robustness by detecting dead prime_tunnel.Tunnel processes and recreating/restarting them in both CliAgentEnv.get_tunnel_url() and RolloutGatewayMixin.get_gateway_tunnel_url() (including richer logging with tunnel_id and recent frpc output).

In gateway mode, adds a background _tunnel_health_monitor task that periodically restarts dead tunnels, cancels it during teardown_gateway(), and raises a new vf.TunnelError from poll_job_completion() when a tunnel dies mid-rollout; the interception path now also raises TunnelError when the tunnel dies while waiting for agent requests.

Adds integration tests covering dead-tunnel recreation, health-monitor restarts, teardown cancellation, and the new error behavior, plus introduces TunnelError under verifiers/errors.py.

Written by Cursor Bugbot for commit 3eb0fc5. This will update automatically on new commits. Configure here.

@rasdani rasdani marked this pull request as draft February 27, 2026 01:19
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

total = len(self._tunnels)
logger.debug(f"Health monitor: {alive}/{total} tunnels alive")
except asyncio.CancelledError:
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health monitor silently dies on tunnel restart failure

Medium Severity

_tunnel_health_monitor only catches asyncio.CancelledError. If tunnel.sync_stop() or new_tunnel.start() raises any other exception (e.g., network failure, permission error), the exception propagates out of the loop and the monitor task dies silently with no logging. Since the task is fire-and-forget, the exception is swallowed. Tunnel monitoring stops entirely until the next get_gateway_tunnel_url call happens to restart it, leaving a potentially long window with no proactive dead-tunnel detection.

Fix in Cursor Fix in Web


async def get_gateway_tunnel_url(self, local_addr: str | None = None) -> str:
"""Get gateway tunnel URL, starting the tunnel if needed."""
"""Get gateway tunnel URL, starting the tunnel if needed. Restarts dead tunnels."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead tunnel not detected when local_addr is None

Low Severity

get_gateway_tunnel_url with local_addr=None and exactly one tunnel returns the tunnel's URL without checking is_running. The dead-tunnel detection and restart logic was only added to the local_addr is not None code path, so this convenience path can return a stale URL from a dead tunnel, contradicting the docstring's claim that it "Restarts dead tunnels."

Fix in Cursor Fix in Web

class TunnelError(InfraError):
"""Raised when a tunnel process dies or becomes unreachable."""

pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New TunnelError not added to documented error hierarchy

Low Severity

The new vf.TunnelError class (a subclass of vf.InfraError) is a user-facing error that can be raised during rollouts in both CliAgentEnv and RolloutGatewayMixin, but the documented error hierarchy in docs/environments.md only lists vf.SandboxError as an example under vf.InfraError. Since TunnelError is exported from the package and users may want to handle tunnel failures distinctly (e.g., for retry logic), it warrants mention alongside vf.SandboxError.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant