Skip to content

Azure GPU investigation #61

@vsoch

Description

@vsoch

When parsing the data I noticed that ECC being yes/no was inconsistent. It seemed random at times. But I think this could be an important finding for our study, because (I am reading) that NVIDIA GPUs have ECC (error correcting code) memory that allows the system to detect when memory errors occur. It sounds great, but activating it slows down VRAM. Specifically:

Turning ECC on:

  • It reduces the amount of available memory by 12.5%.
  • It makes context synchronization more expensive.
  • Uncoalesced memory transactions are more expensive when ECC is enabled than otherwise.

That is from https://www.cudahandbook.com/. It makes me wonder the implications for having some on, some off. I think we will need to look at the data more closely to see how consistent the setting is within experiment environment and sizes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions