Skip to content

ci: Enable AWS runners#1388

Merged
chtruong814 merged 13 commits intomainfrom
chtruong/migrate-aws
Jan 27, 2026
Merged

ci: Enable AWS runners#1388
chtruong814 merged 13 commits intomainfrom
chtruong/migrate-aws

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Jan 16, 2026

Description

ci: Enable AWS runners

The install test had to be written a little differently because running the GHA job directly in the container does not work yet on our self-hosted runners. So I had to rewrite to manually start the container and install from there.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 requested a review from thomasdhc January 16, 2026 23:47
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. .github/actions/test-template/action.yml, line 90 (link)

    logic: references undefined step output steps.azure-fileshare.outputs.mnt_path - the azure-fileshare step was removed in this PR, so this variable will be empty and may cause issues

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread .github/workflows/install-test.yml Outdated
@@ -427,9 +409,9 @@ jobs:
all-install-import-test:
needs: [pre-flight]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: missing cuda-pre-flight in needs array - line 414 references needs.cuda-pre-flight.outputs.cuda_base_image but the job doesn't declare it as a dependency

Suggested change
needs: [pre-flight]
needs: [pre-flight, cuda-pre-flight]

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. .github/actions/test-template/action.yml, line 100 (link)

    logic: references deleted step output steps.azure-fileshare.outputs.mnt_path - this step was removed but the variable assignment remains

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. .github/actions/test-template/action.yml, line 100 (link)

    style: Dead code - references removed azure-fileshare step and MNT_PATH is never used

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. .github/actions/test-template/action.yml, line 100 (link)

    style: unused variable assignment referencing removed step

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. .github/actions/test-template/action.yml, line 100 (link)

    style: unused variable references removed step

    MNT_PATH references steps.azure-fileshare.outputs.mnt_path but the azure-fileshare step was removed. This line should be deleted.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +77 to +82
- name: Install uuidgen
shell: bash -x -e -u -o pipefail {0}
if: ${{ contains(inputs.runner, 'aws') }}
run: |
sudo apt update
sudo apt install -y cifs-utils

RESOURCE_GROUP_NAME="azure-gpu-vm-runner_group"
STORAGE_ACCOUNT_NAME="nemocistorageaccount2"
FILE_SHARE_NAME="fileshare"

MNT_ROOT="/media"
MNT_PATH="$MNT_ROOT/$STORAGE_ACCOUNT_NAME/$FILE_SHARE_NAME"

echo "MNT_PATH=$MNT_PATH" | tee -a "$GITHUB_OUTPUT"

sudo mkdir -p $MNT_PATH

# Create a folder to store the credentials for this storage account and
# any other that you might set up.
CREDENTIAL_ROOT="/etc/smbcredentials"
sudo mkdir -p "/etc/smbcredentials"

# Get the storage account key for the indicated storage account.
# You must be logged in with az login and your user identity must have
# permissions to list the storage account keys for this command to work.
STORAGE_ACCOUNT_KEY=$(az storage account keys list \
--resource-group $RESOURCE_GROUP_NAME \
--account-name $STORAGE_ACCOUNT_NAME \
--query "[0].value" --output tsv | tr -d '"')

# Create the credential file for this individual storage account
SMB_CREDENTIAL_FILE="$CREDENTIAL_ROOT/$STORAGE_ACCOUNT_NAME.cred"
if [ ! -f $SMB_CREDENTIAL_FILE ]; then
echo "username=$STORAGE_ACCOUNT_NAME" | sudo tee $SMB_CREDENTIAL_FILE > /dev/null
echo "password=$STORAGE_ACCOUNT_KEY" | sudo tee -a $SMB_CREDENTIAL_FILE > /dev/null
else
echo "The credential file $SMB_CREDENTIAL_FILE already exists, and was not modified."
fi

# Change permissions on the credential file so only root can read or modify the password file.
sudo chmod 600 $SMB_CREDENTIAL_FILE

# This command assumes you have logged in with az login
HTTP_ENDPOINT=$(az storage account show --resource-group $RESOURCE_GROUP_NAME --name $STORAGE_ACCOUNT_NAME --query "primaryEndpoints.file" --output tsv | tr -d '"')
SMB_PATH=$(echo $HTTP_ENDPOINT | cut -c7-${#HTTP_ENDPOINT})$FILE_SHARE_NAME

STORAGE_ACCOUNT_KEY=$(az storage account keys list --resource-group $RESOURCE_GROUP_NAME --account-name $STORAGE_ACCOUNT_NAME --query "[0].value" --output tsv | tr -d '"')

sudo mount -t cifs $SMB_PATH $MNT_PATH -o credentials=$SMB_CREDENTIAL_FILE,serverino,nosharesock,actimeo=30,mfsymlinks

ls -al $MNT_PATH/TestData
apt-get update
apt-get install -y uuid-runtime
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running apt-get without sudo while shell doesn't specify root privileges - this may fail on non-root runners

Suggested change
- name: Install uuidgen
shell: bash -x -e -u -o pipefail {0}
if: ${{ contains(inputs.runner, 'aws') }}
run: |
sudo apt update
sudo apt install -y cifs-utils
RESOURCE_GROUP_NAME="azure-gpu-vm-runner_group"
STORAGE_ACCOUNT_NAME="nemocistorageaccount2"
FILE_SHARE_NAME="fileshare"
MNT_ROOT="/media"
MNT_PATH="$MNT_ROOT/$STORAGE_ACCOUNT_NAME/$FILE_SHARE_NAME"
echo "MNT_PATH=$MNT_PATH" | tee -a "$GITHUB_OUTPUT"
sudo mkdir -p $MNT_PATH
# Create a folder to store the credentials for this storage account and
# any other that you might set up.
CREDENTIAL_ROOT="/etc/smbcredentials"
sudo mkdir -p "/etc/smbcredentials"
# Get the storage account key for the indicated storage account.
# You must be logged in with az login and your user identity must have
# permissions to list the storage account keys for this command to work.
STORAGE_ACCOUNT_KEY=$(az storage account keys list \
--resource-group $RESOURCE_GROUP_NAME \
--account-name $STORAGE_ACCOUNT_NAME \
--query "[0].value" --output tsv | tr -d '"')
# Create the credential file for this individual storage account
SMB_CREDENTIAL_FILE="$CREDENTIAL_ROOT/$STORAGE_ACCOUNT_NAME.cred"
if [ ! -f $SMB_CREDENTIAL_FILE ]; then
echo "username=$STORAGE_ACCOUNT_NAME" | sudo tee $SMB_CREDENTIAL_FILE > /dev/null
echo "password=$STORAGE_ACCOUNT_KEY" | sudo tee -a $SMB_CREDENTIAL_FILE > /dev/null
else
echo "The credential file $SMB_CREDENTIAL_FILE already exists, and was not modified."
fi
# Change permissions on the credential file so only root can read or modify the password file.
sudo chmod 600 $SMB_CREDENTIAL_FILE
# This command assumes you have logged in with az login
HTTP_ENDPOINT=$(az storage account show --resource-group $RESOURCE_GROUP_NAME --name $STORAGE_ACCOUNT_NAME --query "primaryEndpoints.file" --output tsv | tr -d '"')
SMB_PATH=$(echo $HTTP_ENDPOINT | cut -c7-${#HTTP_ENDPOINT})$FILE_SHARE_NAME
STORAGE_ACCOUNT_KEY=$(az storage account keys list --resource-group $RESOURCE_GROUP_NAME --account-name $STORAGE_ACCOUNT_NAME --query "[0].value" --output tsv | tr -d '"')
sudo mount -t cifs $SMB_PATH $MNT_PATH -o credentials=$SMB_CREDENTIAL_FILE,serverino,nosharesock,actimeo=30,mfsymlinks
ls -al $MNT_PATH/TestData
apt-get update
apt-get install -y uuid-runtime
- name: Install uuidgen
shell: bash -x -e -u -o pipefail {0}
if: ${{ contains(inputs.runner, 'aws') }}
run: |
sudo apt-get update
sudo apt-get install -y uuid-runtime

Comment on lines +426 to +429
cat > ${{ github.workspace }}/.run_install.sh << 'SCRIPT'
set -e
apt-get update && apt-get install -y --no-install-recommends git curl
curl -LsSf https://astral.sh/uv/install.sh | sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piping remote script directly to shell (curl | sh) creates security risk - if connection is interrupted or URL is compromised, arbitrary code could execute

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 23, 2026

Additional Comments (1)

.github/actions/test-template/action.yml
references undefined variable - steps.azure-fileshare.outputs.mnt_path is no longer available after Azure Fileshare step was removed

        # MNT_PATH is no longer needed as we use the parameterized test-data-path

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +81 to +82
apt-get update
apt-get install -y uuid-runtime
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apt-get commands run without sudo but shell doesn't specify root context - may fail on runners without root privileges

Comment thread .github/workflows/install-test.yml Outdated
uv run python -c "import sys; print('Python:', sys.version);"
uv run python -c "from nemo_curator import package_info; print(f'nemo_curator v{package_info.__version__}')"
uv run python -c "import cudf, cuml; print('cudf, cuml');"
uv run python -c "import torch; print(f'torch {torch.__version__} (CUDA: {assert not torch.cuda.is_available()})');"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid Python syntax - assert is a statement, not an expression; this will cause a SyntaxError

Suggested change
uv run python -c "import torch; print(f'torch {torch.__version__} (CUDA: {assert not torch.cuda.is_available()})');"
uv run python -c "import torch; print(f'torch {torch.__version__} (CUDA: {torch.cuda.is_available()})');"

cat > ${{ github.workspace }}/.run_install.sh << 'SCRIPT'
set -e
apt-get update && apt-get install -y --no-install-recommends git curl
curl -LsSf https://astral.sh/uv/install.sh | sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piping remote script to shell (curl | sh) without verification creates security risk - if connection is interrupted or compromised, arbitrary code could execute

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 23, 2026

Additional Comments (1)

.github/actions/test-template/action.yml
referencing removed step output - the azure-fileshare step was removed but this line still tries to access its output, will fail at runtime

        MNT_PATH=""

This reverts commit 09fdc5f.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants