restart dead tunnels by rasdani · Pull Request #964 · PrimeIntellect-ai/verifiers

rasdani · 2026-02-27T01:19:00Z

Description

monitors and restarts tunnels that have died in CliAgentEnv and RolloutGatewayEnv

depends on PrimeIntellect-ai/prime#403

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Changes add automatic tunnel restarts and new failure paths (TunnelError) during rollouts, which can alter runtime behavior and error handling in production environments. Risk is moderate because it touches long-running background tasks and sandbox rollout completion loops.

Overview
Improves tunnel robustness by detecting dead prime_tunnel.Tunnel processes and recreating/restarting them in both CliAgentEnv.get_tunnel_url() and RolloutGatewayMixin.get_gateway_tunnel_url() (including richer logging with tunnel_id and recent frpc output).

In gateway mode, adds a background _tunnel_health_monitor task that periodically restarts dead tunnels, cancels it during teardown_gateway(), and raises a new vf.TunnelError from poll_job_completion() when a tunnel dies mid-rollout; the interception path now also raises TunnelError when the tunnel dies while waiting for agent requests.

Adds integration tests covering dead-tunnel recreation, health-monitor restarts, teardown cancellation, and the new error behavior, plus introduces TunnelError under verifiers/errors.py.

^{Written by Cursor Bugbot for commit 3eb0fc5. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-27T01:27:22Z

verifiers/envs/experimental/rollout_gateway_mixin.py

+                    total = len(self._tunnels)
+                    logger.debug(f"Health monitor: {alive}/{total} tunnels alive")
+        except asyncio.CancelledError:
+            return


Health monitor silently dies on tunnel restart failure

Medium Severity

_tunnel_health_monitor only catches asyncio.CancelledError. If tunnel.sync_stop() or new_tunnel.start() raises any other exception (e.g., network failure, permission error), the exception propagates out of the loop and the monitor task dies silently with no logging. Since the task is fire-and-forget, the exception is swallowed. Tunnel monitoring stops entirely until the next get_gateway_tunnel_url call happens to restart it, leaving a potentially long window with no proactive dead-tunnel detection.

cursor · 2026-02-27T01:27:22Z

verifiers/envs/experimental/rollout_gateway_mixin.py


    async def get_gateway_tunnel_url(self, local_addr: str | None = None) -> str:
-        """Get gateway tunnel URL, starting the tunnel if needed."""
+        """Get gateway tunnel URL, starting the tunnel if needed. Restarts dead tunnels."""


Dead tunnel not detected when local_addr is None

Low Severity

get_gateway_tunnel_url with local_addr=None and exactly one tunnel returns the tunnel's URL without checking is_running. The dead-tunnel detection and restart logic was only added to the local_addr is not None code path, so this convenience path can return a stale URL from a dead tunnel, contradicting the docstring's claim that it "Restarts dead tunnels."

cursor · 2026-02-27T01:27:22Z

verifiers/errors.py

+class TunnelError(InfraError):
+    """Raised when a tunnel process dies or becomes unreachable."""
+
+    pass


New TunnelError not added to documented error hierarchy

Low Severity

The new vf.TunnelError class (a subclass of vf.InfraError) is a user-facing error that can be raised during rollouts in both CliAgentEnv and RolloutGatewayMixin, but the documented error hierarchy in docs/environments.md only lists vf.SandboxError as an example under vf.InfraError. Since TunnelError is exported from the package and users may want to handle tunnel failures distinctly (e.g., for retry logic), it warrants mention alongside vf.SandboxError.

^{Triggered by project rule: BugBot Instructions}

restart dead tunnels

3eb0fc5

rasdani marked this pull request as draft February 27, 2026 01:19

cursor bot reviewed Feb 27, 2026

View reviewed changes

bump tunnel package

30f8502

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart dead tunnels#964

restart dead tunnels#964
rasdani wants to merge 2 commits intomainfrom
daniel/restart-dead-tunnels

rasdani commented Feb 27, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 27, 2026

Uh oh!

cursor bot Feb 27, 2026

Uh oh!

cursor bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rasdani commented Feb 27, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 27, 2026

Choose a reason for hiding this comment

Health monitor silently dies on tunnel restart failure

Uh oh!

cursor bot Feb 27, 2026

Choose a reason for hiding this comment

Dead tunnel not detected when local_addr is None

Uh oh!

cursor bot Feb 27, 2026

Choose a reason for hiding this comment

New TunnelError not added to documented error hierarchy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rasdani commented Feb 27, 2026 •

edited by cursor bot

Loading