Skip to content

Support AMD MIxxx double-die #1097

@benoit-cty

Description

@benoit-cty

Description:

The AMD Instinct MI250 accelerator card contains two Graphics Compute Dies (GCDs) per physical card. However, when monitoring energy consumption (e.g., via rocm-smi or tools like CodeCarbon), only one GCD reports power usage, while the other shows zero values. This is problematic for accurate energy accounting, especially in HPC/SLURM environments where jobs may be allocated a single GCD.
Expected Behavior:
Both GCDs on the same MI250 card should report their individual power consumption, or the total card power should be clearly attributed to the active GCD(s).

Current Behavior:

Only one GCD provides non-zero power readings.
The second GCD always reports 0W, even when under load.
This leads to underestimated energy measurements and complicates per-job accounting.

Steps to Reproduce:

Allocate a single GCD on an MI250 card via SLURM (e.g., --gres=gpu:1).
Run a workload on the allocated GCD.
Use rocm-smi --showpower or similar tools to monitor energy.
Observe that the second GCD (on the same card) reports 0W, despite the card’s total power draw.

Impact:

Inaccurate energy tracking for jobs sharing a card.
Difficulty distinguishing per-GCD power usage.
Tools like CodeCarbon may misreport energy if they rely on per-GCD metrics.
Suggested Fix:

Provide a way to query total card power (sum of both GCDs) when monitoring a single GCD.
Alternatively, expose power readings for both GCDs, even if only one is allocated to a job.

Context:

This issue affects users in SLURM/HPC environments where fine-grained energy monitoring is critical for carbon footprint tracking and resource management.

Additional Notes:

The MI300 series may have similar behavior; clarification would be helpful.
Workarounds (e.g., manually summing GCDs) are error-prone and not scalable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions