-
-
Notifications
You must be signed in to change notification settings - Fork 259
Description
Description:
The AMD Instinct MI250 accelerator card contains two Graphics Compute Dies (GCDs) per physical card. However, when monitoring energy consumption (e.g., via rocm-smi or tools like CodeCarbon), only one GCD reports power usage, while the other shows zero values. This is problematic for accurate energy accounting, especially in HPC/SLURM environments where jobs may be allocated a single GCD.
Expected Behavior:
Both GCDs on the same MI250 card should report their individual power consumption, or the total card power should be clearly attributed to the active GCD(s).
Current Behavior:
Only one GCD provides non-zero power readings.
The second GCD always reports 0W, even when under load.
This leads to underestimated energy measurements and complicates per-job accounting.
Steps to Reproduce:
Allocate a single GCD on an MI250 card via SLURM (e.g., --gres=gpu:1).
Run a workload on the allocated GCD.
Use rocm-smi --showpower or similar tools to monitor energy.
Observe that the second GCD (on the same card) reports 0W, despite the card’s total power draw.
Impact:
Inaccurate energy tracking for jobs sharing a card.
Difficulty distinguishing per-GCD power usage.
Tools like CodeCarbon may misreport energy if they rely on per-GCD metrics.
Suggested Fix:
Provide a way to query total card power (sum of both GCDs) when monitoring a single GCD.
Alternatively, expose power readings for both GCDs, even if only one is allocated to a job.
Context:
This issue affects users in SLURM/HPC environments where fine-grained energy monitoring is critical for carbon footprint tracking and resource management.
Additional Notes:
The MI300 series may have similar behavior; clarification would be helpful.
Workarounds (e.g., manually summing GCDs) are error-prone and not scalable.