Skip to content

fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980

Open
nvshaxie wants to merge 6 commits into
mainfrom
fix/dashboard-port-preflight-3953
Open

fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980
nvshaxie wants to merge 6 commits into
mainfrom
fix/dashboard-port-preflight-3953

Conversation

@nvshaxie
Copy link
Copy Markdown
Contributor

@nvshaxie nvshaxie commented May 21, 2026

Summary

  • findAvailableDashboardPort already raises the right "All dashboard ports in range 18789-18799 are occupied …" error when every port is held, but it only fires in late onboarding (after preflight + gateway start + inference selection). Reporters expect the failure to be upfront.
  • Add a narrow preflightDashboardPortRangeAvailability() helper in src/lib/onboard/dashboard-port.ts that only checks host bindings (safe to run before OpenShell is reachable) and call it at the end of preflight(). Exits 1 with the same canonical message when every port in DASHBOARD_PORT_RANGE_START..END is bound.
  • Drop the unused findDashboardForwardOwner import/re-export from src/lib/onboard.ts to keep the entrypoint net-neutral per the onboard-entrypoint-budget workflow (the only remaining caller is dashboard-port.test.ts, which imports it directly from the module).

Bug reproduction (pre-fix)

```
$ for p in $(seq 18789 18799); do python3 -m http.server "$p" &>/dev/null & done
$ nemoclaw onboard --non-interactive --yes-i-accept-third-party-software --name overflow-test ...
[1/8] Preflight checks ← passes
[2/8] Starting OpenShell gateway ← runs anyway
[3/8] Configuring inference (NIM) ← runs anyway
Inference options menu … ← reaches the wizard
```

Behavior post-fix (same setup)

```
[1/8] Preflight checks
✓ Docker is running … ✓ Memory OK …
All dashboard ports in range 18789-18799 are occupied:
18789 → non-OpenShell host listener
18790 → non-OpenShell host listener
...
18799 → non-OpenShell host listener
Free a sandbox or use --control-ui-port with a port outside this range.
```

Exit 1. `[2/8] Starting OpenShell gateway` is never printed.

Why the helper is sound at this stage

  • findAvailableDashboardPort needs openshell forward list (so it can distinguish "this port is bound by an OpenShell forward that belongs to this sandbox" from "host listener"). The preflight helper deliberately skips that distinction and treats every bound port as a non-OpenShell listener.
  • That is correct here: if every port in the range is bound (either by external processes OR by other sandboxes), the new sandbox cannot use any of them anyway, and operator intervention is required either way (--control-ui-port <N> or freeing a sandbox).

Test plan

  • New `describe("preflightDashboardPortRangeAvailability ([NemoClaw][All Platforms][Onboard] Onboard does not fail fast when all dashboard ports 18789-18799 are occupied by external processes #3953)")` block in `src/lib/onboard/dashboard-port.test.ts` with 3 cases (every port bound → exit 1 with canonical message; 10/11 bound + 1 free → no exit; no ports bound → no exit). Uses injected stubs for `isPortBoundOnHost` and `process.exit`.
  • `npx vitest run src/lib/onboard/dashboard-port.test.ts` — 11/11 pass.
  • Manual on Ubuntu 24.04 x86_64: occupied 18789-18799 with `python3 -m http.server`, ran `node bin/nemoclaw.js onboard --non-interactive ...`. Exits at the end of `[1/8] Preflight checks` with the canonical message and the per-port list; `[2/8] Starting OpenShell gateway` never prints.
  • `src/lib/onboard.ts` diff stat: `+2/-2` (net 0), within the entrypoint budget.

Fixes #3953.

Signed-off-by: Shawn Xie shaxie@nvidia.com

Summary by CodeRabbit

  • New Features

    • Preflight now detects when all dashboard ports are exhausted and blocks startup with a clear error showing per-port details and guidance to specify an alternate control UI port. This runs on normal start and when resuming, but only blocks if no explicit control UI port was provided.
  • Tests

    • Added tests for fully occupied, partially available, and completely free port-range scenarios for the new preflight check.

Review Change Stack

…upied (#3953)

`findAvailableDashboardPort` already raises the right error
("All dashboard ports in range 18789-18799 are occupied …") when every
port is held, but it only runs late in onboarding — after preflight,
gateway start, and inference selection have already had side effects.
A user with all 11 dashboard ports held by external processes
(python3, etc.) saw onboard proceed all the way to the inference menu
before any port-related failure surfaced, contradicting the contract
in the docs / test plan.

Add a narrow `preflightDashboardPortRangeAvailability()` helper to
`src/lib/onboard/dashboard-port.ts` and call it from `preflight()`
right before it returns the GPU detection. The helper only checks
host bindings (it does not need OpenShell forward state, so it is
safe before gateway start). When every port in
`DASHBOARD_PORT_RANGE_START..END` is bound, it prints the same
canonical message and exits 1.

To keep `src/lib/onboard.ts` net-neutral per the
`onboard-entrypoint-budget` workflow, the unused
`findDashboardForwardOwner` import / re-export is dropped (no
callers outside the colocated unit test, which imports it directly
from `./onboard/dashboard-port`). Net delta on `onboard.ts`: +2/-2.

Test plan
- Added 3 unit tests in `src/lib/onboard/dashboard-port.test.ts`
  with injected `isPortBoundOnHost`/`process.exit` stubs:
  * exits 1 with the canonical message when every port in the range
    is bound; stderr lists each port → non-OpenShell host listener.
  * returns without exiting when 10 of 11 are bound and 1 is free.
  * returns without exiting when no port is bound.
- `npx vitest run src/lib/onboard/dashboard-port.test.ts` — 11/11 pass.
- Manual end-to-end on Ubuntu 24.04 x86_64:
  * Occupy 18789-18799 with `python3 -m http.server` placeholders.
  * `nemoclaw onboard --non-interactive ...` now exits inside
    `[1/8] Preflight checks` with
    `All dashboard ports in range 18789-18799 are occupied:`
    followed by one line per occupied port. `[2/8] Starting OpenShell
    gateway` and `[3/8] Configuring inference` are never printed.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@nvshaxie nvshaxie requested review from cv and ericksoa May 21, 2026 07:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed8115aa-1a20-43fe-b829-7d2f7b6e5e10

📥 Commits

Reviewing files that changed from the base of the PR and between ec7e315 and a0cc430.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Adds preflightDashboardPortRangeAvailability to scan dashboard ports 18789–18799; it returns early if any port is free, otherwise prints per-port diagnostics and exits(1). Onboard imports and invokes this check only when no explicit control UI port is set (including NEMOCLAW_DASHBOARD_PORT), and the test suite adds cases for full, partial, and empty occupancy.

Changes

Dashboard Port Exhaustion Preflight

Layer / File(s) Summary
Dashboard port exhaustion check implementation and tests
src/lib/onboard/dashboard-port.ts, src/lib/onboard/dashboard-port.test.ts
preflightDashboardPortRangeAvailability iterates the dashboard port range with an injectable bound check, returns early if any port is free, and emits a per-port error + non-zero exit when all ports are occupied. Tests cover full exhaustion (exit+error), partial availability, and no-bound scenarios.
Onboard preflight integration
src/lib/onboard.ts
Imports and conditionally invokes preflightDashboardPortRangeAvailability() during preflight when no explicit --control-ui-port or NEMOCLAW_DASHBOARD_PORT is set, updates _preflightDashboardPort initialization, and removes findDashboardForwardOwner from module.exports.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Networking, Sandbox

Suggested reviewers

  • ericksoa

Poem

🐰
I hopped the ports from nine to nine,
I sniffed and tapped each hollow line.
If every gate returns a lock,
I call the halt — I stop the clock.
Now onboarding hears my sign.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: introducing fail-fast preflight detection when all dashboard ports are occupied.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #3953: adds preflightDashboardPortRangeAvailability() to detect occupied ports, implements fail-fast with canonical error message listing each port, exits with status 1, and includes comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: new preflight helper function, integration into onboard flow with conditional guards, comprehensive test coverage, and removal of unused import to maintain budget neutrality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dashboard-port-preflight-3953

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

E2E Advisor Recommendation

Required E2E: onboard-negative-paths-e2e, onboard-resume-e2e, double-onboard-e2e
Optional E2E: cloud-onboard-e2e, sandbox-operations-e2e

Dispatch hint: onboard-negative-paths-e2e,onboard-resume-e2e,double-onboard-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • onboard-negative-paths-e2e (high): Exercises non-interactive onboard failure paths and friendly port/preflight conflict handling. This PR changes fail-fast behavior in onboard preflight, so the negative-path suite should block merge.
  • onboard-resume-e2e (medium): The PR explicitly changes the cached-preflight resume path to call dashboard port range availability checks; run the resume suite to ensure interrupted onboard recovery still completes correctly.
  • double-onboard-e2e (high): Provides the strongest existing live coverage for dashboard port allocation across repeated/multiple onboard runs, distinct dashboard forwards, and stored-port recovery, all adjacent to the changed dashboard-port module.

Optional E2E

  • cloud-onboard-e2e (medium): Useful broad smoke for install → cloud onboard → sandbox/dashboard/inference setup after changing core onboard preflight, but less targeted than the required port/resume/lifecycle suites.
  • sandbox-operations-e2e (high): Optional additional confidence for sandbox lifecycle and gateway/dashboard reachability after port-forwarding logic changes.

New E2E recommendations

  • dashboard-port-preflight (high): Existing E2E coverage does not appear to bind every dashboard port in 18789-18799 and assert that nemoclaw onboard --non-interactive exits during preflight with the canonical 'All dashboard ports in range' message before starting the gateway or creating a sandbox.
    • Suggested test: Add an onboard negative-path E2E phase that starts host listeners on 18789-18799, runs non-interactive onboard without --control-ui-port/NEMOCLAW_DASHBOARD_PORT, and verifies exit 1, the canonical remediation text, and no gateway/sandbox side effects.
  • dashboard-port-resume-preflight (medium): The changed resume branch is only indirectly covered by existing resume success E2E; there is no targeted E2E proving cached-preflight resume fails fast when the dashboard range becomes exhausted after the initial interrupted run.
    • Suggested test: Add a resume regression E2E that creates a resumable session with preflight complete, binds 18789-18799, runs nemoclaw onboard --resume --non-interactive, and asserts the same fail-fast dashboard port exhaustion behavior without rerunning gateway/sandbox setup.
  • dashboard-port-explicit-env (medium): The PR changes NEMOCLAW_DASHBOARD_PORT handling to skip default-range preflight gating for an explicit env-configured dashboard port; existing E2E searches show only cleanup/default references, not an explicit env override assertion.
    • Suggested test: Add an E2E assertion that with 18789-18799 occupied and NEMOCLAW_DASHBOARD_PORT set to a free outside-range port, onboard proceeds past the new range preflight and uses the explicit dashboard port.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: onboard-negative-paths-e2e,onboard-resume-e2e,double-onboard-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: a0cc4305b16ccab1c5e1e1e43e8c58d811ebe5b1
Findings: 1 blocker(s), 3 warning(s), 1 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review is based on supplied trusted metadata and read-only diff inspection; tests, package-manager commands, and PR scripts were not executed.; PR titles, bodies, comments, bot output, and branch names were treated as untrusted evidence unless corroborated by diff/status metadata.; The E2E Advisor comment was found and parsed, but the status rollup did not show the required E2E job names as passed for head SHA a0cc430.; Full runtime behavior, including absence of gateway/sandbox side effects, resume behavior, and explicit dashboard port override behavior, still requires E2E or maintainer-verified equivalent evidence.; Linked issue acceptance mapping includes literal issue body clauses and the supplied issue comment; environment-specific and manual reproduction clauses remain unknown or partial where not corroborated by automated evidence.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: a0cc4305b16ccab1c5e1e1e43e8c58d811ebe5b1
Recommendation: blocked
Confidence: high

The targeted port-exhaustion fix is small and directionally correct, but mergeability is BLOCKED and E2E Advisor-required onboard jobs are not evidenced as passed for this head SHA.

Gate status

  • CI: pass — 5 required status context(s) completed with no failures for head SHA a0cc430. GraphQL rollup also shows checks, commit-lint, dco-check, check-hash, changes, onboard-entrypoint-budget, CodeQL, ShellCheck, cli-parity, macos-e2e, test-e2e-ollama-proxy, and E2E recommendation completed successfully; PR review advisor was in progress at collection time.
  • Mergeability: fail — mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED for headRefOid a0cc430.
  • Review threads: pass — 2 review thread(s), all resolved. CodeRabbit findings for explicit --control-ui-port handling and onboard-entrypoint-budget were marked addressed through a0cc430.
  • Risky code tested: warning — Risky areas detected: onboarding/host glue. Unit tests were added for the new helper, but runtime/sandbox/infrastructure paths still require E2E evidence for the current head SHA.

🔴 Blockers

  • Mergeability is blocked for the current head SHA: The PR cannot be considered merge-ready while GitHub reports the merge state as BLOCKED and reviewDecision as REVIEW_REQUIRED. This is a hard gate independent of the code-level review.
    • Recommendation: Resolve the repository merge gate requirements, including required human review, and re-check mergeability for a0cc430 before merge.
    • Evidence: Trusted GitHub GraphQL metadata: mergeStateStatus=BLOCKED, reviewDecision=REVIEW_REQUIRED, headRefOid=a0cc4305b16ccab1c5e1e1e43e8c58d811ebe5b1.

🟡 Warnings

  • E2E Advisor-required onboard jobs are not evidenced as passed for this head SHA (src/lib/onboard.ts:3848): This PR changes onboard preflight behavior before gateway/sandbox creation and the resume cached-preflight branch. The E2E Advisor required onboard-negative-paths-e2e, onboard-resume-e2e, and double-onboard-e2e, but the status rollup provided for a0cc430 does not show checks with those exact job names as successful.
    • Recommendation: Verify that onboard-negative-paths-e2e, onboard-resume-e2e, and double-onboard-e2e have run and passed for a0cc430, or document an explicit maintainer-accepted mapping if these jobs appear under different check names.
    • Evidence: E2E Advisor comment required onboard-negative-paths-e2e, onboard-resume-e2e, double-onboard-e2e. GraphQL rollup for a0cc430 lists E2E recommendation, macos-e2e, and test-e2e-ollama-proxy, but not those required job names.
  • Unit tests do not prove full onboard side-effect boundary (src/lib/onboard/dashboard-port.test.ts:104): The added tests validate preflightDashboardPortRangeAvailability in isolation with injected port-bound and exit stubs. They do not run the full nemoclaw onboard path, so they do not prove that step 2/gateway startup and sandbox creation are skipped, that resume cached-preflight behavior fails fast, or that NEMOCLAW_DASHBOARD_PORT / --control-ui-port bypasses the default-range scan in integrated execution.
    • Recommendation: Add or verify E2E/integration coverage that binds 18789-18799, runs onboard without explicit dashboard port and asserts exit 1 before gateway/sandbox side effects; add/verify resume and explicit env/flag override cases recommended by the E2E Advisor.
    • Evidence: dashboard-port.test.ts directly invokes preflightDashboardPortRangeAvailability(allBound, exitFn), partial/free stubs, and console.error capture. The changed full onboard call sites are in src/lib/onboard.ts normal preflight and resume cached-preflight branches.
  • High active overlap on onboard.ts increases drift risk (src/lib/onboard.ts:3848): The patched code still exists and the diff is small, but src/lib/onboard.ts is a large active monolith with many concurrent open PRs touching the same file, including onboard preflight/gateway extraction work. This raises sequencing and rebase risk around where this fail-fast check belongs.
    • Recommendation: Before merge, re-check against latest main and nearby onboard preflight/gateway refactors to ensure the guarded calls remain in the intended pre-gateway locations in both normal and resume paths.
    • Evidence: Trusted overlap metadata lists many open PRs touching src/lib/onboard.ts, including refactor(cli): extract onboard preflight handler #3868 extract onboard preflight handler and refactor(cli): extract onboard gateway handler #3870 extract onboard gateway handler. Current monolith line count is 10271 with net-zero delta in this PR.

🔵 Suggestions

  • Resume branch change is compressed onto an existing statement (src/lib/onboard.ts:9353): The resume cached-preflight branch appends the new dashboard-port preflight call after validateSandboxGpuPreflight on the same line. This stays within the entrypoint budget, but it reduces readability in an already large control-flow block.
    • Recommendation: If line-budget rules allow, prefer splitting validateSandboxGpuPreflight(resumeSandboxGpuConfig); and the guarded dashboard-port check onto separate lines or moving the helper call into a small extracted function in a follow-up refactor.
    • Evidence: Diff shows validateSandboxGpuPreflight(resumeSandboxGpuConfig); if (_preflightDashboardPort === null) preflightDashboardPortRangeAvailability(); // #3953 — resume must mirror non-resume fail-fast on one line.

Acceptance coverage

  • met — When all dashboard ports in the default 18789–18799 range are held by external processes, nemoclaw onboard continues instead of failing fast with a clear "all dashboard ports occupied" error.: src/lib/onboard/dashboard-port.ts adds preflightDashboardPortRangeAvailability(), which scans DASHBOARD_PORT_RANGE_START..END and exits 1 with an all-ports-occupied error if every port is bound; src/lib/onboard.ts calls it during normal preflight when no explicit dashboard port is configured.
  • met — Component area: Onboard (preflight, dashboard port allocation).: Changed files are src/lib/onboard.ts and src/lib/onboard/dashboard-port.ts/test.ts, directly covering onboard preflight integration and dashboard port allocation behavior.
  • unknown — Host: DGX Spark (NVIDIA GB10, FastOS), aarch64: No trusted automated evidence shows validation on DGX Spark/FastOS/aarch64. PR body claims manual Ubuntu 24.04 x86_64 verification, which is not the same environment.
  • unknown — nemoclaw: v0.0.46: The patch targets current source at a0cc430, not a validation run against the historical v0.0.46 package.
  • unknown — openshell: 0.0.39: The new preflight helper deliberately does not call OpenShell. No supplied status evidence validates behavior specifically with OpenShell 0.0.39.
  • unknown — Docker: Docker CE; nvidia user in docker group; docker ps empty before test.: No supplied automated test evidence verifies this Docker precondition; helper tests are stub-based.
  • unknown — nemoclaw list

(no sandboxes): No supplied automated test evidence runs nemoclaw list or verifies an empty sandbox registry before the scenario.

  • unknown — docker ps

(empty): No supplied automated test evidence runs docker ps or verifies an empty Docker state before the scenario.

  • partial — 1. Start placeholder servers on all 11 ports:: Unit tests simulate all ports as bound via an injected allBound stub; they do not start real placeholder servers.
  • partial — for p in $(seq 18789 18799); do
    python3 -m http.server "$p" &> /dev/null &
    done: PR body reports manual verification with python3 http.server placeholders, but the added automated test uses a stubbed bound check instead of spawning real listeners.
  • partial — 2. Verify ports:: Production code uses isPortBoundOnHost via lsof/sudo lsof/bind probe, but added automated tests stub the port-bound check and do not verify real listener state.
  • partial — ss -ltn | grep -E ':1879[0-9]'

LISTEN lines for 18789–18799, all external (python3): The helper treats every bound port as non-OpenShell host listener, but no automated test executes ss or validates python3 process ownership.

  • partial — 3. Run onboarding:: src/lib/onboard.ts integrates the helper into the onboard preflight path, but the added automated tests call the helper directly rather than running the full onboard command.
  • partial — nemoclaw onboard --name overflow-test: No added automated test runs this full command. PR body reports manual CLI verification, which remains untrusted unless corroborated by CI/E2E artifacts.
  • partial — 4. Observe the preflight and subsequent steps.: The normal preflight call is placed before the Step 2 gateway section, and the resume path mirrors it after cached GPU validation; however, added tests do not assert step output ordering or absence of later-step side effects.
  • met — Preflight should detect that all candidate dashboard ports (18789–18799) are occupied by non-OpenShell processes.: preflightDashboardPortRangeAvailability scans 18789 through 18799 and exits only when every isPortBoundCheck call returns true, labelling each bound port as non-OpenShell host listener.
  • partialnemoclaw onboard should abort before the inference step with a fatal error similar to:: The helper is invoked at the end of preflight before gateway/inference setup in the normal path and in resume cached-preflight, but no full onboard automated test proves abort ordering before inference.
  • met — All dashboard ports in range 18789-18799 are occupied:: dashboard-port.test.ts asserts stderr matches All dashboard ports in range 18789-18799 are occupied:.
  • partial — - 18789 → non-OpenShell host listener: Implementation emits 18789 → non-OpenShell host listener without the dash prefix shown in the issue example; test asserts the implemented semantic content for 18789.
  • partial — - 18790 → non-OpenShell host listener: Implementation emits a line for every bound port including 18790, but tests explicitly assert only the first and last port lines.

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, tokens, passwords, key files, credential JSON, or secret logging were added. Changes are limited to dashboard port preflight probing, onboard integration, exports, and tests.
  • pass — 2. Input Validation and Data Sanitization: No new untrusted string parsing, deserialization, eval, URL parsing, shell-string construction, SQL/XSS/path traversal surface, or SSRF-sensitive handling is introduced. The scanned ports are fixed numeric constants; the explicit port override path is guarded by existing option/env resolution.
  • pass — 3. Authentication and Authorization: No endpoints, auth checks, authorization decisions, token validation, scopes, or user/resource access controls are modified.
  • pass — 4. Dependencies and Third-Party Libraries: No dependencies, package manager files, registries, or third-party versions are added or changed.
  • pass — 5. Error Handling and Logging: The new error path intentionally reports local dashboard port occupancy and remediation guidance. It does not log credentials, tokens, PII, stack traces, or internal sensitive state. process.exit is injectable for tests.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, hashing for security, encryption, credential storage, or data protection mechanisms are changed.
  • warning — 7. Configuration and Security Headers: The change affects runtime configuration semantics for dashboard port selection. The current diff guards the default-range preflight when --control-ui-port or NEMOCLAW_DASHBOARD_PORT is present, but integrated E2E evidence for explicit env/flag bypass is not shown for this head SHA. No HTTP security headers are involved.
  • warning — 8. Security Testing: Unit tests cover all-bound, partially-bound, and unbound helper behavior. Runtime security boundary behavior still needs E2E evidence: failure before gateway/sandbox side effects, resume cached-preflight behavior, and explicit NEMOCLAW_DASHBOARD_PORT/--control-ui-port bypass.
  • warning — 9. Holistic Security Posture: Fail-fast preflight improves operational safety by avoiding unnecessary gateway/sandbox work when the dashboard range is exhausted. Remaining risk is coverage/gate related: required E2E jobs are missing for the current head SHA, and the onboard.ts monolith has high active overlap that can cause drift.

Test / E2E status

  • Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage because src/lib/onboard.ts and src/lib/onboard/dashboard-port.ts alter preflight behavior before gateway start and in resume cached-preflight. Unit tests validate the helper in isolation but cannot prove full CLI sequencing, absence of gateway/sandbox side effects, or explicit dashboard port bypass behavior.
  • E2E Advisor: missing
  • Required E2E jobs: onboard-negative-paths-e2e, onboard-resume-e2e, double-onboard-e2e
  • Missing for analyzed SHA: onboard-negative-paths-e2e, onboard-resume-e2e, double-onboard-e2e

✅ What looks good

  • The change patches code that still exists on the current head and keeps the large onboard entrypoint net-zero in line count; onboard-entrypoint-budget passed.
  • The prior explicit --control-ui-port concern is addressed by guarding preflightDashboardPortRangeAvailability with _preflightDashboardPort === null, and NEMOCLAW_DASHBOARD_PORT is treated as an explicit dashboard-port signal.
  • The resume cached-preflight path now mirrors the normal preflight fail-fast behavior, closing a likely bypass of the new check.
  • The helper is narrow and dependency-injectable, which enables deterministic unit tests without starting real listeners.
  • No new dependencies, Dockerfiles, installer logic, workflow trusted-code boundary changes, credentials handling changes, SSRF-sensitive URL parsing, or network policy changes were introduced.

Review completeness

  • Review is based on supplied trusted metadata and read-only diff inspection; tests, package-manager commands, and PR scripts were not executed.
  • PR titles, bodies, comments, bot output, and branch names were treated as untrusted evidence unless corroborated by diff/status metadata.
  • The E2E Advisor comment was found and parsed, but the status rollup did not show the required E2E job names as passed for head SHA a0cc430.
  • Full runtime behavior, including absence of gateway/sandbox side effects, resume behavior, and explicit dashboard port override behavior, still requires E2E or maintainer-verified equivalent evidence.
  • Linked issue acceptance mapping includes literal issue body clauses and the supplied issue comment; environment-specific and manual reproduction clauses remain unknown or partial where not corroborated by automated evidence.
  • Human maintainer review required: yes

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 3844: preflightDashboardPortRangeAvailability() is being called
unconditionally which causes failures even when the user provided an explicit
--control-ui-port; update the logic so the range-exhaustion check runs only when
no explicit control UI port override is present — detect the override (the CLI
option or config value that represents the explicit port, e.g. controlUiPort /
controlUiPortOverride / options.controlUiPort or the corresponding env var used
earlier in the file) and skip calling preflightDashboardPortRangeAvailability()
when that override is set, leaving the existing explicit-port path intact (the
explicit-port handling you preserved earlier) so the onboarding respects
--control-ui-port inputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a75a1aed-7ee0-41b2-be45-486dcc29bbfe

📥 Commits

Reviewing files that changed from the base of the PR and between 18c7265 and a916031.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/dashboard-port.test.ts
  • src/lib/onboard/dashboard-port.ts

Comment thread src/lib/onboard.ts Outdated
@wscurran wscurran added bug Something isn't working fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels May 21, 2026
@wscurran
Copy link
Copy Markdown
Contributor

…ort is set

Addresses CodeRabbit on #3980. The previous commit called
preflightDashboardPortRangeAvailability() unconditionally at the end
of preflight(), which broke the explicit --control-ui-port path: a
user who passes `--control-ui-port 3000` to escape the default range
would still hit the "All dashboard ports in range 18789-18799 are
occupied" exit, even though that range is no longer the one we'll
use.

Guard the call on `_preflightDashboardPort === null` so the new
fail-fast only applies to the auto-allocation path. The explicit-port
path is already validated separately above (lines ~3670-3700, behind
the same `dashboardPortToCheck` flag).

Verified end-to-end: with 18789-18799 held by python3 placeholders,
`nemoclaw onboard --control-ui-port 23000` no longer prints the
range-exhaustion error in preflight; without the flag, the existing
fail-fast behavior is unchanged.

onboard.ts diff stays at +2/-2 net 0, within the entrypoint budget.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@nvshaxie
Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai — applied your suggested guard in the followup commit. preflightDashboardPortRangeAvailability() is now only called when _preflightDashboardPort === null (i.e. no --control-ui-port override). The explicit-port path on lines ~3670-3700 already validates the chosen port separately, so no double-checking is needed.

Verified end-to-end on Ubuntu 24.04 with 18789-18799 held by python3 -m http.server placeholders:

```
$ nemoclaw onboard --control-ui-port 23000 --non-interactive ...
[1/8] Preflight checks
✓ Memory OK …
[2/8] Starting OpenShell gateway ← proceeds past preflight (would have exited before)

```

Without the flag, the existing fail-fast behavior is unchanged (still prints All dashboard ports in range 18789-18799 are occupied: and exits 1 before step 2).

onboard.ts diff stays at +2/-2, net 0, within the entrypoint budget.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Tip

For best results, initiate chat on the files or code changes.

@nvshaxie: The guard makes perfect sense — skipping the range check when an explicit --control-ui-port is provided avoids redundant validation and keeps the explicit-port path self-contained. The end-to-end verification on Ubuntu 24.04 with all 18789–18799 ports occupied confirms both the fail-fast path and the bypass path behave correctly. Looks good! 🐇 ✨


✏️ Learnings added
Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T01:44:56.363Z
Learning: In `src/lib/onboard.ts`, `preflightDashboardPortRangeAvailability()` (imported from `./onboard/dashboard-port`) is guarded to only be called when `_preflightDashboardPort === null` (i.e., no `--control-ui-port` override was provided). The explicit-port validation path (around lines 3670–3700 of `onboard.ts`) handles its own port availability check separately, so calling `preflightDashboardPortRangeAvailability()` in that branch would be a double-check.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/lib/onboard.ts (3)

3545-3551: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t only release the default dashboard port during gateway cleanup.

This cleanup path tears down the gateway but only stops DASHBOARD_PORT. NemoClaw can be forwarding on an auto-allocated or explicit port, so those SSH forwards survive the cleanup and can later show up as bogus host listeners that block onboarding. The same default-port assumption is repeated in the other new cleanup call sites below.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3545 - 3551, The cleanup currently only
stops DASHBOARD_PORT (calling bestEffortForwardStop(runOpenshell,
DASHBOARD_PORT)), which leaves other NemoClaw forwards (auto-allocated or
explicit ports) active and can create stale host listeners; update the gateway
teardown path in the block using destroyGatewayForReuse and
bestEffortForwardStop to stop all active forwards instead of just DASHBOARD_PORT
— either by retrieving the gateway's list of forwarded ports and calling
bestEffortForwardStop for each, or by adding/using a helper (e.g.,
stopAllForwards or similar) that accepts runOpenshell/gatewayForwardState and
stops every forwarded port before calling destroyGatewayForReuse; ensure you
update the other new cleanup call sites that mirror this pattern so all
forwarded ports are cleaned up, referencing bestEffortForwardStop, runOpenshell,
DASHBOARD_PORT, and destroyGatewayForReuse to locate the changes.

8981-8981: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat NEMOCLAW_DASHBOARD_PORT as an explicit preflight override.

Line 8981 only seeds _preflightDashboardPort from opts.controlUiPort. If the operator sets NEMOCLAW_DASHBOARD_PORT=23000 and the default 18789-18799 range is full, this still takes the auto-allocation branch, runs the new range-exhaustion check, and exits even though 23000 is the intended dashboard port. The explicit-port path and the fail-fast skip need to use the resolved env override too.

💡 Suggested fix
-  _preflightDashboardPort = opts.controlUiPort || null;
+  const envDashboardPort =
+    process.env.NEMOCLAW_DASHBOARD_PORT != null ? DASHBOARD_PORT : null;
+  _preflightDashboardPort = opts.controlUiPort ?? envDashboardPort ?? null;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` at line 8981, The code only initializes
_preflightDashboardPort from opts.controlUiPort; treat the environment override
NEMOCLAW_DASHBOARD_PORT as an explicit preflight override by reading
process.env.NEMOCLAW_DASHBOARD_PORT (if present) when setting
_preflightDashboardPort so that the explicit-port path is taken; update the
dashboard-preflight logic that runs the range-exhaustion check (the
auto-allocation branch) to skip that check when _preflightDashboardPort is
resolved from either opts.controlUiPort or process.env.NEMOCLAW_DASHBOARD_PORT
so the operator-provided port (23000) is honored instead of triggering
range-exhaustion exit.

9328-9350: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resume still bypasses the new dashboard-range fail-fast.

When cached preflight is reused here, the new guard at Line 3827 never runs. If the default range becomes fully occupied after the initial attempt, --resume can still progress into later steps and only fail in dashboard allocation after sandbox work has started, which is exactly the late failure this PR is trying to move forward.

💡 Suggested fix
       assertCdiNvidiaGpuSpecPresent(assessHost(), resumeOptedOutGpuPassthrough);
       validateSandboxGpuPreflight(resumeSandboxGpuConfig);
+      if (_preflightDashboardPort === null) {
+        preflightDashboardPortRangeAvailability();
+      }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9328 - 9350, Resume path skips the new
dashboard-range fail-fast guard; ensure the same dashboard allocation check used
in the normal preflight path is invoked during resume so we fail fast on range
exhaustion. Modify the resumePreflight branch (the if (resumePreflight) block
that calls skippedStepMessage, resolveSandboxGpuConfig,
exitOnSandboxGpuConfigErrors, assertCdiNvidiaGpuSpecPresent, and
validateSandboxGpuPreflight) to call the dashboard-range fail-fast function used
by the non-resume flow (the same guard invoked outside this block) before
proceeding to sandbox GPU validation so resume cannot bypass dashboard
allocation checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3545-3551: The cleanup currently only stops DASHBOARD_PORT
(calling bestEffortForwardStop(runOpenshell, DASHBOARD_PORT)), which leaves
other NemoClaw forwards (auto-allocated or explicit ports) active and can create
stale host listeners; update the gateway teardown path in the block using
destroyGatewayForReuse and bestEffortForwardStop to stop all active forwards
instead of just DASHBOARD_PORT — either by retrieving the gateway's list of
forwarded ports and calling bestEffortForwardStop for each, or by adding/using a
helper (e.g., stopAllForwards or similar) that accepts
runOpenshell/gatewayForwardState and stops every forwarded port before calling
destroyGatewayForReuse; ensure you update the other new cleanup call sites that
mirror this pattern so all forwarded ports are cleaned up, referencing
bestEffortForwardStop, runOpenshell, DASHBOARD_PORT, and destroyGatewayForReuse
to locate the changes.
- Line 8981: The code only initializes _preflightDashboardPort from
opts.controlUiPort; treat the environment override NEMOCLAW_DASHBOARD_PORT as an
explicit preflight override by reading process.env.NEMOCLAW_DASHBOARD_PORT (if
present) when setting _preflightDashboardPort so that the explicit-port path is
taken; update the dashboard-preflight logic that runs the range-exhaustion check
(the auto-allocation branch) to skip that check when _preflightDashboardPort is
resolved from either opts.controlUiPort or process.env.NEMOCLAW_DASHBOARD_PORT
so the operator-provided port (23000) is honored instead of triggering
range-exhaustion exit.
- Around line 9328-9350: Resume path skips the new dashboard-range fail-fast
guard; ensure the same dashboard allocation check used in the normal preflight
path is invoked during resume so we fail fast on range exhaustion. Modify the
resumePreflight branch (the if (resumePreflight) block that calls
skippedStepMessage, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors,
assertCdiNvidiaGpuSpecPresent, and validateSandboxGpuPreflight) to call the
dashboard-range fail-fast function used by the non-resume flow (the same guard
invoked outside this block) before proceeding to sandbox GPU validation so
resume cannot bypass dashboard allocation checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fefc6d8c-a930-484a-b4e7-033502152492

📥 Commits

Reviewing files that changed from the base of the PR and between a916031 and 37d5531.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

… gate resume path

CodeRabbit pointed out two follow-ups on PR #3980's fail-fast preflight:

1. The env override NEMOCLAW_DASHBOARD_PORT pins an explicit dashboard
   port the same way --control-ui-port does, but only --control-ui-port
   was disabling the new range-exhaustion fail-fast. Operators pinning
   18900 with the default 18789-18799 range full would still exit on
   preflight even though their port was outside the range. Seed
   _preflightDashboardPort from the resolved env-override too.

2. The new fail-fast guard ran in the normal preflight path only; the
   --resume path reuses cached preflight and would walk past an
   exhausted range, only failing later in dashboard allocation after
   sandbox work has started — exactly the late failure this PR is
   trying to prevent. Mirror the guard in the resume branch.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nvshaxie
Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai — addressed the two findings that scope to this PR's preflight change in commit ec7e315:

  • 8981NEMOCLAW_DASHBOARD_PORT env override now disables the range-exhaustion fail-fast the same way --control-ui-port does. Implemented as suggested via the resolved DASHBOARD_PORT constant from ./core/ports.
  • 9328-9350 — added the same if (_preflightDashboardPort === null) preflightDashboardPortRangeAvailability() guard in the --resume branch so cached preflight no longer skips it.

Skipping 3545-3551 (and the mirrored cleanup call sites) — only releasing DASHBOARD_PORT in cleanup is pre-existing behavior unrelated to the fail-fast added here. Happy to file it as a separate issue if that's the right move.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Tip

For best results, initiate chat on the files or code changes.

@nvshaxie: All three points make sense — great follow-through on the two in-scope fixes:

  • NEMOCLAW_DASHBOARD_PORT guard (line 8981): Using the resolved DASHBOARD_PORT constant from ./core/ports as the env-override signal is the right approach — it keeps the guard symmetric with the --control-ui-port path and avoids a raw string comparison.
  • --resume branch guard (lines 9328–9350): Adding the same if (_preflightDashboardPort === null) check there ensures cached-preflight runs are protected too. Good catch closing that gap.

On lines 3545–3551 — filing it as a separate issue is exactly the right call. Would you like me to open a follow-up issue in this repo to track the cleanup of those DASHBOARD_PORT release call sites?

🐇 ✨


✏️ Learnings added
Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T02:05:42.745Z
Learning: In `src/lib/onboard.ts`, the `preflightDashboardPortRangeAvailability()` range-exhaustion fail-fast is disabled not only when `--control-ui-port` is provided (`_preflightDashboardPort !== null`) but also when `NEMOCLAW_DASHBOARD_PORT` env override is set (detected via the resolved `DASHBOARD_PORT` constant from `./core/ports`). Both explicit-port paths skip the default-range check.

Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T02:05:42.745Z
Learning: In `src/lib/onboard.ts`, the `if (_preflightDashboardPort === null) preflightDashboardPortRangeAvailability()` guard is present in both the standard preflight branch and the `--resume` (cached-preflight) branch (around lines 9328–9350) so that range-exhaustion detection is not bypassed when onboarding resumes from a cached state.

Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T02:05:42.745Z
Learning: The DASHBOARD_PORT release/cleanup call sites around lines 3545–3551 of `src/lib/onboard.ts` represent pre-existing behavior unrelated to the fail-fast preflight change introduced in PR `#3980`. Any cleanup or refactor of those call sites should be tracked as a separate issue.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8981-8985: The new const envDashboardPort is both wrong (uses
DASHBOARD_PORT instead of the env value) and adds lines; replace the two-line
block with a single expression that uses the env var directly and preserves the
precedence: set _preflightDashboardPort = opts.controlUiPort ??
(process.env.NEMOCLAW_DASHBOARD_PORT ?? null); do the same collapse for the
other occurrence that mirrors this logic (the block around the symbols at the
second location).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a30a6e9f-573d-4e98-a9a2-bf83c2497dc5

📥 Commits

Reviewing files that changed from the base of the PR and between 37d5531 and ec7e315.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated
cv and others added 2 commits May 21, 2026 19:09
Replace the verbose 4-line env-override comment block with a single
inline ternary on the assignment line, and pack the resume guard onto
the existing validateSandboxGpuPreflight call. Semantically identical
to ec7e315; just brings src/lib/onboard.ts back under the
onboard-entrypoint-budget gate (additions ≤ deletions).

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][All Platforms][Onboard] Onboard does not fail fast when all dashboard ports 18789-18799 are occupied by external processes

3 participants