fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980
fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980nvshaxie wants to merge 6 commits into
Conversation
…upied (#3953) `findAvailableDashboardPort` already raises the right error ("All dashboard ports in range 18789-18799 are occupied …") when every port is held, but it only runs late in onboarding — after preflight, gateway start, and inference selection have already had side effects. A user with all 11 dashboard ports held by external processes (python3, etc.) saw onboard proceed all the way to the inference menu before any port-related failure surfaced, contradicting the contract in the docs / test plan. Add a narrow `preflightDashboardPortRangeAvailability()` helper to `src/lib/onboard/dashboard-port.ts` and call it from `preflight()` right before it returns the GPU detection. The helper only checks host bindings (it does not need OpenShell forward state, so it is safe before gateway start). When every port in `DASHBOARD_PORT_RANGE_START..END` is bound, it prints the same canonical message and exits 1. To keep `src/lib/onboard.ts` net-neutral per the `onboard-entrypoint-budget` workflow, the unused `findDashboardForwardOwner` import / re-export is dropped (no callers outside the colocated unit test, which imports it directly from `./onboard/dashboard-port`). Net delta on `onboard.ts`: +2/-2. Test plan - Added 3 unit tests in `src/lib/onboard/dashboard-port.test.ts` with injected `isPortBoundOnHost`/`process.exit` stubs: * exits 1 with the canonical message when every port in the range is bound; stderr lists each port → non-OpenShell host listener. * returns without exiting when 10 of 11 are bound and 1 is free. * returns without exiting when no port is bound. - `npx vitest run src/lib/onboard/dashboard-port.test.ts` — 11/11 pass. - Manual end-to-end on Ubuntu 24.04 x86_64: * Occupy 18789-18799 with `python3 -m http.server` placeholders. * `nemoclaw onboard --non-interactive ...` now exits inside `[1/8] Preflight checks` with `All dashboard ports in range 18789-18799 are occupied:` followed by one line per occupied port. `[2/8] Starting OpenShell gateway` and `[3/8] Configuring inference` are never printed. Signed-off-by: Shawn Xie <shaxie@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds ChangesDashboard Port Exhaustion Preflight
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Review is based on supplied trusted metadata and read-only diff inspection; tests, package-manager commands, and PR scripts were not executed.; PR titles, bodies, comments, bot output, and branch names were treated as untrusted evidence unless corroborated by diff/status metadata.; The E2E Advisor comment was found and parsed, but the status rollup did not show the required E2E job names as passed for head SHA a0cc430.; Full runtime behavior, including absence of gateway/sandbox side effects, resume behavior, and explicit dashboard port override behavior, still requires E2E or maintainer-verified equivalent evidence.; Linked issue acceptance mapping includes literal issue body clauses and the supplied issue comment; environment-specific and manual reproduction clauses remain unknown or partial where not corroborated by automated evidence. Full advisor summaryPR Review AdvisorBase: The targeted port-exhaustion fix is small and directionally correct, but mergeability is BLOCKED and E2E Advisor-required onboard jobs are not evidenced as passed for this head SHA. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
(no sandboxes): No supplied automated test evidence runs
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Line 3844: preflightDashboardPortRangeAvailability() is being called
unconditionally which causes failures even when the user provided an explicit
--control-ui-port; update the logic so the range-exhaustion check runs only when
no explicit control UI port override is present — detect the override (the CLI
option or config value that represents the explicit port, e.g. controlUiPort /
controlUiPortOverride / options.controlUiPort or the corresponding env var used
earlier in the file) and skip calling preflightDashboardPortRangeAvailability()
when that override is set, leaving the existing explicit-port path intact (the
explicit-port handling you preserved earlier) so the onboarding respects
--control-ui-port inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a75a1aed-7ee0-41b2-be45-486dcc29bbfe
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/dashboard-port.test.tssrc/lib/onboard/dashboard-port.ts
…ort is set Addresses CodeRabbit on #3980. The previous commit called preflightDashboardPortRangeAvailability() unconditionally at the end of preflight(), which broke the explicit --control-ui-port path: a user who passes `--control-ui-port 3000` to escape the default range would still hit the "All dashboard ports in range 18789-18799 are occupied" exit, even though that range is no longer the one we'll use. Guard the call on `_preflightDashboardPort === null` so the new fail-fast only applies to the auto-allocation path. The explicit-port path is already validated separately above (lines ~3670-3700, behind the same `dashboardPortToCheck` flag). Verified end-to-end: with 18789-18799 held by python3 placeholders, `nemoclaw onboard --control-ui-port 23000` no longer prints the range-exhaustion error in preflight; without the flag, the existing fail-fast behavior is unchanged. onboard.ts diff stays at +2/-2 net 0, within the entrypoint budget. Signed-off-by: Shawn Xie <shaxie@nvidia.com>
|
Thanks @coderabbitai — applied your suggested guard in the followup commit. Verified end-to-end on Ubuntu 24.04 with 18789-18799 held by ``` Without the flag, the existing fail-fast behavior is unchanged (still prints
|
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/lib/onboard.ts (3)
3545-3551:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t only release the default dashboard port during gateway cleanup.
This cleanup path tears down the gateway but only stops
DASHBOARD_PORT. NemoClaw can be forwarding on an auto-allocated or explicit port, so those SSH forwards survive the cleanup and can later show up as bogus host listeners that block onboarding. The same default-port assumption is repeated in the other new cleanup call sites below.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 3545 - 3551, The cleanup currently only stops DASHBOARD_PORT (calling bestEffortForwardStop(runOpenshell, DASHBOARD_PORT)), which leaves other NemoClaw forwards (auto-allocated or explicit ports) active and can create stale host listeners; update the gateway teardown path in the block using destroyGatewayForReuse and bestEffortForwardStop to stop all active forwards instead of just DASHBOARD_PORT — either by retrieving the gateway's list of forwarded ports and calling bestEffortForwardStop for each, or by adding/using a helper (e.g., stopAllForwards or similar) that accepts runOpenshell/gatewayForwardState and stops every forwarded port before calling destroyGatewayForReuse; ensure you update the other new cleanup call sites that mirror this pattern so all forwarded ports are cleaned up, referencing bestEffortForwardStop, runOpenshell, DASHBOARD_PORT, and destroyGatewayForReuse to locate the changes.
8981-8981:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTreat
NEMOCLAW_DASHBOARD_PORTas an explicit preflight override.Line 8981 only seeds
_preflightDashboardPortfromopts.controlUiPort. If the operator setsNEMOCLAW_DASHBOARD_PORT=23000and the default 18789-18799 range is full, this still takes the auto-allocation branch, runs the new range-exhaustion check, and exits even though 23000 is the intended dashboard port. The explicit-port path and the fail-fast skip need to use the resolved env override too.💡 Suggested fix
- _preflightDashboardPort = opts.controlUiPort || null; + const envDashboardPort = + process.env.NEMOCLAW_DASHBOARD_PORT != null ? DASHBOARD_PORT : null; + _preflightDashboardPort = opts.controlUiPort ?? envDashboardPort ?? null;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` at line 8981, The code only initializes _preflightDashboardPort from opts.controlUiPort; treat the environment override NEMOCLAW_DASHBOARD_PORT as an explicit preflight override by reading process.env.NEMOCLAW_DASHBOARD_PORT (if present) when setting _preflightDashboardPort so that the explicit-port path is taken; update the dashboard-preflight logic that runs the range-exhaustion check (the auto-allocation branch) to skip that check when _preflightDashboardPort is resolved from either opts.controlUiPort or process.env.NEMOCLAW_DASHBOARD_PORT so the operator-provided port (23000) is honored instead of triggering range-exhaustion exit.
9328-9350:⚠️ Potential issue | 🟠 Major | ⚡ Quick winResume still bypasses the new dashboard-range fail-fast.
When cached preflight is reused here, the new guard at Line 3827 never runs. If the default range becomes fully occupied after the initial attempt,
--resumecan still progress into later steps and only fail in dashboard allocation after sandbox work has started, which is exactly the late failure this PR is trying to move forward.💡 Suggested fix
assertCdiNvidiaGpuSpecPresent(assessHost(), resumeOptedOutGpuPassthrough); validateSandboxGpuPreflight(resumeSandboxGpuConfig); + if (_preflightDashboardPort === null) { + preflightDashboardPortRangeAvailability(); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 9328 - 9350, Resume path skips the new dashboard-range fail-fast guard; ensure the same dashboard allocation check used in the normal preflight path is invoked during resume so we fail fast on range exhaustion. Modify the resumePreflight branch (the if (resumePreflight) block that calls skippedStepMessage, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors, assertCdiNvidiaGpuSpecPresent, and validateSandboxGpuPreflight) to call the dashboard-range fail-fast function used by the non-resume flow (the same guard invoked outside this block) before proceeding to sandbox GPU validation so resume cannot bypass dashboard allocation checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3545-3551: The cleanup currently only stops DASHBOARD_PORT
(calling bestEffortForwardStop(runOpenshell, DASHBOARD_PORT)), which leaves
other NemoClaw forwards (auto-allocated or explicit ports) active and can create
stale host listeners; update the gateway teardown path in the block using
destroyGatewayForReuse and bestEffortForwardStop to stop all active forwards
instead of just DASHBOARD_PORT — either by retrieving the gateway's list of
forwarded ports and calling bestEffortForwardStop for each, or by adding/using a
helper (e.g., stopAllForwards or similar) that accepts
runOpenshell/gatewayForwardState and stops every forwarded port before calling
destroyGatewayForReuse; ensure you update the other new cleanup call sites that
mirror this pattern so all forwarded ports are cleaned up, referencing
bestEffortForwardStop, runOpenshell, DASHBOARD_PORT, and destroyGatewayForReuse
to locate the changes.
- Line 8981: The code only initializes _preflightDashboardPort from
opts.controlUiPort; treat the environment override NEMOCLAW_DASHBOARD_PORT as an
explicit preflight override by reading process.env.NEMOCLAW_DASHBOARD_PORT (if
present) when setting _preflightDashboardPort so that the explicit-port path is
taken; update the dashboard-preflight logic that runs the range-exhaustion check
(the auto-allocation branch) to skip that check when _preflightDashboardPort is
resolved from either opts.controlUiPort or process.env.NEMOCLAW_DASHBOARD_PORT
so the operator-provided port (23000) is honored instead of triggering
range-exhaustion exit.
- Around line 9328-9350: Resume path skips the new dashboard-range fail-fast
guard; ensure the same dashboard allocation check used in the normal preflight
path is invoked during resume so we fail fast on range exhaustion. Modify the
resumePreflight branch (the if (resumePreflight) block that calls
skippedStepMessage, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors,
assertCdiNvidiaGpuSpecPresent, and validateSandboxGpuPreflight) to call the
dashboard-range fail-fast function used by the non-resume flow (the same guard
invoked outside this block) before proceeding to sandbox GPU validation so
resume cannot bypass dashboard allocation checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fefc6d8c-a930-484a-b4e7-033502152492
📒 Files selected for processing (1)
src/lib/onboard.ts
… gate resume path CodeRabbit pointed out two follow-ups on PR #3980's fail-fast preflight: 1. The env override NEMOCLAW_DASHBOARD_PORT pins an explicit dashboard port the same way --control-ui-port does, but only --control-ui-port was disabling the new range-exhaustion fail-fast. Operators pinning 18900 with the default 18789-18799 range full would still exit on preflight even though their port was outside the range. Seed _preflightDashboardPort from the resolved env-override too. 2. The new fail-fast guard ran in the normal preflight path only; the --resume path reuses cached preflight and would walk past an exhausted range, only failing later in dashboard allocation after sandbox work has started — exactly the late failure this PR is trying to prevent. Mirror the guard in the resume branch. Signed-off-by: Shawn Xie <shaxie@nvidia.com>
|
Thanks @coderabbitai — addressed the two findings that scope to this PR's preflight change in commit ec7e315:
Skipping |
|
Tip For best results, initiate chat on the files or code changes.
On lines 3545–3551 — filing it as a separate issue is exactly the right call. Would you like me to open a follow-up issue in this repo to track the cleanup of those 🐇 ✨ ✏️ Learnings added
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8981-8985: The new const envDashboardPort is both wrong (uses
DASHBOARD_PORT instead of the env value) and adds lines; replace the two-line
block with a single expression that uses the env var directly and preserves the
precedence: set _preflightDashboardPort = opts.controlUiPort ??
(process.env.NEMOCLAW_DASHBOARD_PORT ?? null); do the same collapse for the
other occurrence that mirrors this logic (the block around the symbols at the
second location).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a30a6e9f-573d-4e98-a9a2-bf83c2497dc5
📒 Files selected for processing (1)
src/lib/onboard.ts
Replace the verbose 4-line env-override comment block with a single inline ternary on the assignment line, and pack the resume guard onto the existing validateSandboxGpuPreflight call. Semantically identical to ec7e315; just brings src/lib/onboard.ts back under the onboard-entrypoint-budget gate (additions ≤ deletions). Signed-off-by: Shawn Xie <shaxie@nvidia.com>
Summary
findAvailableDashboardPortalready raises the right "All dashboard ports in range 18789-18799 are occupied …" error when every port is held, but it only fires in late onboarding (after preflight + gateway start + inference selection). Reporters expect the failure to be upfront.preflightDashboardPortRangeAvailability()helper insrc/lib/onboard/dashboard-port.tsthat only checks host bindings (safe to run before OpenShell is reachable) and call it at the end ofpreflight(). Exits 1 with the same canonical message when every port inDASHBOARD_PORT_RANGE_START..ENDis bound.findDashboardForwardOwnerimport/re-export fromsrc/lib/onboard.tsto keep the entrypoint net-neutral per theonboard-entrypoint-budgetworkflow (the only remaining caller isdashboard-port.test.ts, which imports it directly from the module).Bug reproduction (pre-fix)
```
$ for p in $(seq 18789 18799); do python3 -m http.server "$p" &>/dev/null & done
$ nemoclaw onboard --non-interactive --yes-i-accept-third-party-software --name overflow-test ...
[1/8] Preflight checks ← passes
[2/8] Starting OpenShell gateway ← runs anyway
[3/8] Configuring inference (NIM) ← runs anyway
Inference options menu … ← reaches the wizard
```
Behavior post-fix (same setup)
```
[1/8] Preflight checks
✓ Docker is running … ✓ Memory OK …
All dashboard ports in range 18789-18799 are occupied:
18789 → non-OpenShell host listener
18790 → non-OpenShell host listener
...
18799 → non-OpenShell host listener
Free a sandbox or use --control-ui-port with a port outside this range.
```
Exit 1. `[2/8] Starting OpenShell gateway` is never printed.
Why the helper is sound at this stage
findAvailableDashboardPortneedsopenshell forward list(so it can distinguish "this port is bound by an OpenShell forward that belongs to this sandbox" from "host listener"). The preflight helper deliberately skips that distinction and treats every bound port as a non-OpenShell listener.--control-ui-port <N>or freeing a sandbox).Test plan
Fixes #3953.
Signed-off-by: Shawn Xie shaxie@nvidia.com
Summary by CodeRabbit
New Features
Tests