Skip to content

Conversation

@ocaisa
Copy link
Member

@ocaisa ocaisa commented Jan 28, 2026

Right now, if I use init/lmod/bash somewhere where I already have Lmod available I get

# Run the module command in my shell
ocaisa@~$ module av

----------------------------------- /usr/share/lmod/lmod/modulefiles -----------------------------------
   Core/lmod    Core/settarg (D)

  Where:
   D:  Default Module

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

# Source EESSI
ocaisa@~$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
Lmod has detected the following error:  The following module(s) are unknown: "EESSI/2023.06"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "EESSI/2023.06"

Also make sure that all modulefiles written in TCL start with the string #%Module


# ...but the module is available
ocaisa@~$ module av

------------------------ /cvmfs/software.eessi.io/versions/2023.06/init/modules ------------------------
   EESSI/2023.06

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

which is quite confusing (the module is visibile but it couldn't be loaded). An issue for this was opened in https://gitlab.com/eessi/support/-/issues/221

This PR fixes that problem by forcing Lmod to do a hard(er) reset when initialising.

The PR also introduces new capabilities, a site can append to MODULEPATH and can add additional default modules that can be loaded before or after the EESSI module.

An example of such possible configuration (at a site where the original problem was seen, and that requires specific modules to be loaded to be able to submit jobs to the queuing system):

Last login: Thu Jan 29 16:19:58 2026 from 10.141.10.61
[vsc50801@gligar10 ~]$ export EESSI_VERSION=2025.06
[vsc50801@gligar10 ~]$ module list

Currently Loaded Modules:
  1) env/vsc/doduo (S)   2) env/slurm/doduo (S)   3) env/software/doduo (S)   4) cluster/doduo (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge
[vsc50801@gligar10 ~]$ cat configure_eessi.env
export EESSI_DEFAULT_MODULES_APPEND="cluster/default"
export EESSI_EXTRA_MODULEPATH="$PWD/vsc:/etc/modulefiles/vsc"
export EESSI_MODULE_FAMILY_NAME="env_software"
export EESSI_MODULE_STICKY=1
[vsc50801@gligar10 ~]$ source configure_eessi.env
[vsc50801@gligar10 ~]$ cd software-layer-scripts/
[vsc50801@gligar10 software-layer-scripts]$ source init/lmod/bash
Module for EESSI/2025.06 loaded successfully (requires '--force' option to unload or purge)
[vsc50801@gligar10 software-layer-scripts]$ module list

Currently Loaded Modules:
  1) EESSI/2025.06 (S)   2) env/vsc/doduo (S)   3) env/slurm/doduo (S)   4) cluster/doduo (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge

init/lmod/bash Outdated
# and clear out any memory Lmod might have
unset _ModuleTable001_
# Path to top-level module tree
export MODULEPATH="${EESSI_CVMFS_REPO}/init/modules"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this now exposes all EESSI versions, but loads a specific one

Copy link
Member Author

@ocaisa ocaisa Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also allow people to be able to add to this default MODULEPATH

Copy link
Contributor

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some example comments in for bash only. They also apply to the other shells. Partly the naming of the files is off, the files have no header and don't explain some basic information. All this makes it harder to understand/review what the scripts shall do and if they do that actually. The use and/or naming of $EESSI_VERSION and $EESSI_DEFAULT_VERSION is confusing.

One more comment regarding using non-pinned code (here it's only in an echo statement though).

General comment: The title of the PR seems misleading. The PR seems to add capabilities. How does this make shell initialisations more robust?

It would be great if the PR had included some description: What problem does it solve? Or what improvement or additional feature it provides? What's the main idea to do this?

@ocaisa
Copy link
Member Author

ocaisa commented Jan 30, 2026

@trz42 I've hopefully addressed almost everything in the initial review (including adding a description to the PR). I've left the renaming of files to templates as follow-up issue #155 as it makes this PR harder again to review (and I don't touch that naming within the PR, it was an existing problem)

@ocaisa ocaisa changed the title Make all shell initialisations more robust Make all Lmod-based shell initialisations more robust, add additional features to facilitate site integration Jan 30, 2026
@ocaisa
Copy link
Member Author

ocaisa commented Jan 30, 2026

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Jan 30, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.01/pr_153/910482

date job status comment
Jan 30 16:27:26 UTC 2026 submitted job id 910482 awaits release by job manager
Jan 30 16:28:15 UTC 2026 released job awaits launch by Slurm scheduler
Jan 30 16:29:18 UTC 2026 running job 910482 is running
Jan 30 16:38:44 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-910482.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-17697906840.tar.zstsize: 0 MiB (3261 bytes)
entries: 5
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/lmod/bash
2023.06/init/lmod/csh
2023.06/init/lmod/fish
2023.06/init/lmod/ksh
2023.06/init/lmod/zsh
Jan 30 16:38:44 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 2/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 3/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 4/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ OK ] ( 5/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:a64fx+default
P: latency: 1.72 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:a64fx+default
P: latency: 1.72 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:a64fx+default
P: bandwidth: 8827.92 MB/s (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:a64fx+default
P: bandwidth: 8743.78 MB/s (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:a64fx+default
P: perf: 581.588 timesteps/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:a64fx+default
P: perf: 580.5 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 6/10 test case(s) from 10 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-910482.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Jan 30, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.01/pr_153/910483

date job status comment
Jan 30 16:27:32 UTC 2026 submitted job id 910483 awaits release by job manager
Jan 30 16:28:12 UTC 2026 released job awaits launch by Slurm scheduler
Jan 30 16:29:21 UTC 2026 running job 910483 is running
Jan 30 16:34:36 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-910483.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17697906250.tar.zstsize: 0 MiB (3270 bytes)
entries: 5
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/a64fx
2025.06/init/lmod/bash
2025.06/init/lmod/csh
2025.06/init/lmod/fish
2025.06/init/lmod/ksh
2025.06/init/lmod/zsh
Jan 30 16:34:36 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.86 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 8211.52 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-910483.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants