Skip to content

Conversation

@rickgrubin-noaa
Copy link
Collaborator

Description

Updates to EPIC hosts (RDHPCS on premises, not yet CSPs) for spack-stack/release/2.0

Dependencies

None

Issues addressed

Working towards #1835

Applications affected

None

Systems affected

  • derecho
  • gaea-c6
  • hercules
  • orion
  • ursa

N.B. Once environments are installed on EPIC hosts, the following modulefiles require manual editing:

  • <env dir>/modules/Core/stack-<compiler>/<version>.lua
  • <env dir>/modules/<compiler>/stack-<mpi>/<version>.lua

In each module file, reverse the order of the following two stanzas:

  • -- spack compiler module hierarchy
  • -- prerequisite modules

RDHPCS hosts often (always?) provide modules for commonly used packages (e.g. hdf5, netcdf-c) that are also built within the stack. Loading system compiler / mpi modules after loading core stack components leads to confusion later on, as MODULEPATH will then necessarily prefer system-provided package modules rather than stack-provided package modules.

Testing

  • CI: Note whether the automatic tests (GitHub actions tests that run automatically for every commit) pass or not
    • GitHub actions CI tests pass
    • GitHub actions CI tests do not pass (provide explanation)
    • GitHub actions CI tests skipped (provide explanation if necessary)
  • New tests added: List and describe any new tests added to GitHub actions
    • ...
  • Additional testing: Add information on any additional tests conducted
    • ...

Checklist

  • This PR addresses one issue/problem/enhancement or has a very good reason for not doing so.
  • These changes have been tested on the affected systems and applications.
  • All dependency PRs/issues have been resolved and this PR can be merged.
  • All necessary updates to the documentation on readthedocs are included in this PR
    • For site config updates, check in particular doc/source/PreConfiguredSites.rst and doc/source/MaintainersSection.rst
  • All necessary updates to the spack-stack wiki will be made when this PR is merged

climbfuji and others added 30 commits December 11, 2025 17:53
Copy link
Collaborator

@climbfuji climbfuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).

But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this has the suffix DO NOT USE, whereas the one on Orion doesn't?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneapi@2025.3.1 is not yet installed on hercules; that is expected to happen after the new year.

The pathing is all wrong (it's orion's pathing) but having the yaml file in place makes it easier to edit paths when oneapi@2025.3.1 is installed on hercules.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to maintain two different gcc versions for Gaea C6 (and Derecho)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to maintain two different versions on either host.

Because there was difficulty building envs with GNU on those hosts (and it's not a requirement on gaea-c6, however I was trying it out for the sake of confirming configurations), I was overly hopeful that one would work.

Not a problem to pick one version and remove the extraneous one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch between this and the non-hpcx version with --compiler=oneapi-2025.2.1-hpcx and --compiler=oneapi-2025.2.1 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed you can. 😄

@AlexanderRichert-NOAA
Copy link
Collaborator

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).

But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

@climbfuji
Copy link
Collaborator

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).
But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

We used to have LMOD_TMOD_FIND_FIRST for Derecho in the past, too, and I did suggest that as a possible solution when the issues was first mentioned. But maybe that doesn't work in this case.

@rickgrubin-noaa
Copy link
Collaborator Author

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).
But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

We used to have LMOD_TMOD_FIND_FIRST for Derecho in the past, too, and I did suggest that as a possible solution when the issues was first mentioned. But maybe that doesn't work in this case.

I don't recall using LMOD_TMOD_FIND_FIRST being suggested (but that's a user error) -- I will give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants