Skip to content

ci: Build RL main on Azure#2145

Open
chtruong814 wants to merge 2 commits intomainfrom
chtruong/build-azure-main
Open

ci: Build RL main on Azure#2145
chtruong814 wants to merge 2 commits intomainfrom
chtruong/build-azure-main

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Mar 24, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated CI/CD workflow configuration to adjust build environment settings and container build references.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 24, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@chtruong814 chtruong814 added the CI:docs Run doctest label Mar 24, 2026
@github-actions github-actions bot added the CI Relating to CI label Mar 24, 2026
@chtruong814
Copy link
Contributor Author

/ok to test

@chtruong814 chtruong814 changed the title debug: Build RL main on Azure ci: Build RL main on Azure Mar 24, 2026
@chtruong814 chtruong814 added CI:L0 Run doctests and unit tests and removed CI:docs Run doctest labels Mar 24, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

Updated GitHub Actions workflow inputs in the CI/CD pipeline to use non-NVIDIA variable values for runner prefix, test data path, and registry. Changed the build-container workflow's build reference from a pre-flight output to a literal main branch reference.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow Configuration
.github/workflows/cicd-main.yml
Modified org-member-pre-flight workflow inputs to reference vars.NON_NVIDIA_* variables instead of vars.DEFAULT_* for runner configuration and registry settings. Changed build-container workflow to use literal main branch instead of dynamically sourced test_sha output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major infrastructure changes routing CI/CD to Azure with 496 file changes but empty description and no testing documentation or validation evidence. Revert debug changes or provide comprehensive testing documentation including Azure build success rates, performance metrics, and confirmation all NeMo RL tests pass.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'ci: Build RL main on Azure' is clearly related to the main change: updating CI/CD workflows to build on Azure infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chtruong/build-azure-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/cicd-main.yml (1)

289-302: ⚠️ Potential issue | 🔴 Critical

Hardcoded build-ref: main breaks PR container builds.

Changing build-ref from ${{ needs.pre-flight.outputs.test_sha }} to the literal main means the container will always be built from the main branch, regardless of which PR triggered the workflow.

This creates two problems:

  1. PR changes won't be tested: The container won't include code from the PR branch
  2. Inconsistency with build-arg: Line 302 still uses NEMO_RL_COMMIT=${{ needs.pre-flight.outputs.test_sha }}, so the recorded commit won't match the actual code in the container
Suggested fix
-      build-ref: main
+      build-ref: ${{ needs.pre-flight.outputs.test_sha }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/cicd-main.yml around lines 289 - 302, The workflow
hardcodes build-ref: main which prevents PR containers from being built from the
PR commit; update the uses block so build-ref uses the pre-flight output
(replace the literal "main" with ${{ needs.pre-flight.outputs.test_sha }}) to
match the NEMO_RL_COMMIT build-arg and ensure the container is built from the
same commit as the PR; verify the symbols affected are build-ref and
NEMO_RL_COMMIT in the uses: NVIDIA-NeMo/FW-CI-templates invocation so both refer
to ${{ needs.pre-flight.outputs.test_sha }} for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/cicd-main.yml:
- Around line 176-185: The workflow job org-member-pre-flight mistakenly sets
default_runner_prefix, default_test_data_path, and default_registry to the
NON_NVIDIA_* variables, which overrides the default (NVIDIA) infrastructure for
all CI runs; revert those three inputs to their intended default values (remove
assignment to NON_NVIDIA_*), or implement a safe toggle (workflow input or
separate debug workflow) so that only the non_nvidia_* inputs use the
NON_NVIDIA_* vars; ensure org-member-pre-flight continues to pass
non_nvidia_runner_prefix/non_nvidia_test_data_path/non_nvidia_registry as the
non-NVIDIA override while keeping default_* pointed to the original default
values.

---

Outside diff comments:
In @.github/workflows/cicd-main.yml:
- Around line 289-302: The workflow hardcodes build-ref: main which prevents PR
containers from being built from the PR commit; update the uses block so
build-ref uses the pre-flight output (replace the literal "main" with ${{
needs.pre-flight.outputs.test_sha }}) to match the NEMO_RL_COMMIT build-arg and
ensure the container is built from the same commit as the PR; verify the symbols
affected are build-ref and NEMO_RL_COMMIT in the uses:
NVIDIA-NeMo/FW-CI-templates invocation so both refer to ${{
needs.pre-flight.outputs.test_sha }} for consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8d5c072f-e2e2-4d6e-8684-3ddfda32532a

📥 Commits

Reviewing files that changed from the base of the PR and between 71e7d71 and 133c6ae.

📒 Files selected for processing (1)
  • .github/workflows/cicd-main.yml

Comment on lines 176 to 185
org-member-pre-flight:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@v0.80.1
with:
default_runner_prefix: ${{ vars.DEFAULT_RUNNER_PREFIX }}
default_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }}
non_nvidia_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }}
default_test_data_path: ${{ vars.DEFAULT_TEST_DATA_PATH }}
default_test_data_path: ${{ vars.NON_NVIDIA_TEST_DATA_PATH }}
non_nvidia_test_data_path: ${{ vars.NON_NVIDIA_TEST_DATA_PATH }}
default_registry: ${{ vars.DEFAULT_CONTAINER_REGISTRY }}
default_registry: ${{ vars.NON_NVIDIA_CONTAINER_REGISTRY }}
non_nvidia_registry: ${{ vars.NON_NVIDIA_CONTAINER_REGISTRY }}
sso_users_filename: ${{ vars.SSO_USERS_FILENAME }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Debug configuration should not be merged to main.

Setting default_runner_prefix, default_test_data_path, and default_registry to NON_NVIDIA_* values makes them identical to the non-NVIDIA parameters. This forces all CI runs (including NVIDIA org members) to use Azure infrastructure, eliminating the distinction between default and non-NVIDIA paths.

Given the PR title includes "debug:", this appears to be a temporary testing configuration. If merged, this will permanently route all CI to non-NVIDIA infrastructure, which is likely unintended for production.

If this is intentional for debugging, consider:

  1. Using a separate workflow file or branch for Azure testing
  2. Adding a workflow input to toggle infrastructure selection
  3. Reverting these changes before merging
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/cicd-main.yml around lines 176 - 185, The workflow job
org-member-pre-flight mistakenly sets default_runner_prefix,
default_test_data_path, and default_registry to the NON_NVIDIA_* variables,
which overrides the default (NVIDIA) infrastructure for all CI runs; revert
those three inputs to their intended default values (remove assignment to
NON_NVIDIA_*), or implement a safe toggle (workflow input or separate debug
workflow) so that only the non_nvidia_* inputs use the NON_NVIDIA_* vars; ensure
org-member-pre-flight continues to pass
non_nvidia_runner_prefix/non_nvidia_test_data_path/non_nvidia_registry as the
non-NVIDIA override while keeping default_* pointed to the original default
values.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 requested a review from a team as a code owner March 24, 2026 02:49
@chtruong814
Copy link
Contributor Author

/ok to test

@chtruong814 chtruong814 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:L0 Run doctests and unit tests labels Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant