Skip to content

[Issue]: Wrong GPU count when the User is not in render, video groups #123

@itej89

Description

@itej89

Problem Description

When madengine is run on a machine where user is not in render group, it miss counts the number of GPUs because of the warnings from amd-smi

vpolamre@useocpm2m-097-123:~/CMajor-RL/rl/runs/tasks$ amd-smi list --csv | tail -n +3                                                                                                                                                                                                                                                                                                                                                                                                          
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
gpu,gpu_bdf,gpu_uuid,kfd_id,node_id,partition_id
0,N/A,N/A,32700,2,0
1,N/A,N/A,3884,3,0
2,N/A,N/A,29122,4,0
3,N/A,N/A,35464,5,0
4,N/A,N/A,46166,6,0
5,N/A,N/A,64654,7,0
6,N/A,N/A,4769,8,0
7,N/A,N/A,6315,9,0

Operating System

Ubuntu 22.04.5 LTS

CPU

NA

GPU

MI300X

ROCm Version

rocm-7.0.2

ROCm Component

amdsmi

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

` useocpm2m-097-076
MACHINE NAME is useocpm2m-097-076
ℹ️ Inherited 2 environment variables from shell for Docker
ROCm container ROCM_PATH from image OCI config (ci-sglang_sglang-perf_pyt_sglang.ubuntu.amd): /opt/rocm
MAD_DATA_PROVIDER::huggingface: reordered list of data provider types to: {} ...
MAD_DATA_PROVIDER::huggingface: not found.
MAD_DATA_PROVIDER::huggingface: searched for previously. Reusing ...
pre encap post scripts: {'pre_scripts': [{'path': 'scripts/common/pre_scripts/run_rocenv_tool.sh', 'args': 'sglang_sglang-perf_env'}], 'encapsulate_script': '', 'post_scripts': []}
NGPUS requested is ALL (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16).
NGPUS requested is 17 out of 17
⠧ Building and running models...❌ Failed to run sglang/sglang-perf: list index out of range
Created performance CSV file: perf.csv

hostname
useocpm2m-097-076 `

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions