Skip to content

check process binding#312

Merged
satishskamath merged 10 commits intoEESSI:mainfrom
smoors:binding
Feb 12, 2026
Merged

check process binding#312
satishskamath merged 10 commits intoEESSI:mainfrom
smoors:binding

Conversation

@smoors
Copy link
Copy Markdown
Collaborator

@smoors smoors commented Jan 3, 2026

this runs a short test in a prerun cmd to get the process binding, which is checked with the check_process_binding.py script. the results are written into the job error file.

fixes #307

Important

the test currently doesn't fail on binding error, as we don't yet have a bullet-proof solution for setting the binding in all cases (see also the discussion in #305). so, for now, both the errors and warnings are printed as warnings on screen, adding sanity checks can be added in a follow-up PR.

example output:

PROCESS BINDING ERROR: wrong number of processes: expected 3, found 5
PROCESS BINDING ERROR: wrong number of nodes: expected 2, found 1
PROCESS BINDING ERROR: wrong number of cpus per process: expected 4, found Counter({2: 5})
PROCESS BINDING WARNING: processes spanning multiple packages: Counter({2: 1})
PROCESS BINDING WARNING: processes spanning multiple numanodes: Counter({2: 3})
PROCESS BINDING WARNING: processes with cores shared by processing units, indicating hyperthreading: Counter({2: 2}),

Note

i managed to get the correct launcher run command by updating the job resources in the assign_tasks_per_compute_unit function. this also allowed simplifying the openfoam test and make it more robust.

Comment thread eessi/testsuite/eessi_mixin.py
@satishskamath
Copy link
Copy Markdown
Collaborator

satishskamath commented Jan 9, 2026

I think I found a problem.

[satishk@tcn3 ~]$ hwloc-calc -p -H package.numanode core:0-5
Package:0.NUMANode:0
[satishk@tcn3 ~]$ module unload hwloc/2.9.1-GCCcore-12.3.0
[satishk@tcn3 ~]$ module load hwloc/2.8.0-GCCcore-12.2.0

The following have been reloaded with a version change:
  1) GCCcore/12.3.0 => GCCcore/12.2.0     2) libpciaccess/0.17-GCCcore-12.3.0 => libpciaccess/0.17-GCCcore-12.2.0     3) libxml2/2.11.4-GCCcore-12.3.0 => libxml2/2.10.3-GCCcore-12.2.0     4) numactl/2.0.16-GCCcore-12.3.0 => numactl/2.0.16-GCCcore-12.2.0

[satishk@tcn3 ~]$ hwloc-calc -p -H package.numanode core:0-5
unsupported (non-normal) --hierarchical type numanode

Somewhere between hwloc versions 2.8.0 and 2.9.1 there was a change in the numanode object type reporting. So may not work for all OpenMPI versions. The rest of the options still do get reported.

[satishk@tcn3 ~]$ hwloc-calc -p -H package.core.pu core:0-5
Package:0.Core:0.PU:0 Package:0.Core:1.PU:1 Package:0.Core:2.PU:2 Package:0.Core:3.PU:3 Package:0.Core:4.PU:4 Package:0.Core:5.PU:5

@smoors
Copy link
Copy Markdown
Collaborator Author

smoors commented Jan 9, 2026

good catch.
hwloc 2.8.0 is already quite old (2022b toolchain), so i'll add a fallback that skips the NUMA node if not supported.
the NUMA check is nice to have but not super critical imho. the Package check is more important.

@smoors
Copy link
Copy Markdown
Collaborator Author

smoors commented Jan 10, 2026

@satishskamath fallback added. i also added a check for the number of nodes.

Copy link
Copy Markdown
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this requires hwloc-calc, we should check that this is only run if it's available. And skip/print warning if it isn't available.

@satishskamath
Copy link
Copy Markdown
Collaborator

satishskamath commented Feb 12, 2026

Checked OpenFOAM script.

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OPENFOAM_LID_DRIVEN_CAVITY_1M_278524e7"
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=1:0:0
#SBATCH -p rome
#SBATCH --export=None
#SBATCH --mem=218749M
module load EESSI/2023.06
module load OpenFOAM/v2312-foss-2023a
export OMP_NUM_THREADS=1
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=1:compact
export OMPI_MCA_rmaps_base_mapping_policy=slot:PE=1
cd ./cavity3D/1M/fixedTol
source $FOAM_BASH
foamDictionary -entry numberOfSubdomains -set 128 system/decomposeParDict
blockMesh 2>&1 | tee log.blockMesh
mpirun -np 128 redistributePar -decompose -parallel 2>&1 | tee log.decompose
mpirun -np 128 renumberMesh -parallel -overwrite 2>&1 | tee log.renumberMesh
if command -v hwloc-calc >/dev/null; then
mpirun -np 128 /gpfs/home5/satishk/projects/test-suite/eessi/testsuite/get_process_binding.sh | tee /dev/stderr | /gpfs/home5/satishk/projects/test-suite/eessi/testsuite/check_process_binding.py --cpus-per-proc 1 --procs 128 --nodes 1
else
echo 'PROCESS BINDING WARNING: hwloc not available, skipping process binding check' >/dev/stderr
fi
mpirun -np 128 icoFoam -parallel 2>&1 | tee log.icofoam
echo "EESSI_CVMFS_REPO: $EESSI_CVMFS_REPO"
echo "EESSI_SOFTWARE_SUBDIR: $EESSI_SOFTWARE_SUBDIR"
echo "FULL_MODULEPATH: $(module --location show OpenFOAM/v2312-foss-2023a 2>&1)"

Output in rfm.err:

tcn119 Package:0.NUMANode:0.Core:5.PU:5
tcn119 Package:0.NUMANode:0.Core:7.PU:7
tcn119 Package:0.NUMANode:0.Core:6.PU:6
tcn119 Package:0.NUMANode:0.Core:4.PU:4
tcn119 Package:0.NUMANode:0.Core:2.PU:2
tcn119 Package:0.NUMANode:0.Core:3.PU:3
tcn119 Package:0.NUMANode:0.Core:1.PU:1
tcn119 Package:0.NUMANode:1.Core:16.PU:16
tcn119 Package:0.NUMANode:0.Core:0.PU:0
tcn119 Package:0.NUMANode:1.Core:18.PU:18
tcn119 Package:0.NUMANode:0.Core:10.PU:10
tcn119 Package:0.NUMANode:0.Core:9.PU:9
tcn119 Package:0.NUMANode:0.Core:8.PU:8
tcn119 Package:0.NUMANode:0.Core:14.PU:14
tcn119 Package:0.NUMANode:0.Core:12.PU:12
tcn119 Package:0.NUMANode:1.Core:22.PU:22
tcn119 Package:0.NUMANode:1.Core:20.PU:20
tcn119 Package:0.NUMANode:0.Core:11.PU:11
tcn119 Package:0.NUMANode:0.Core:13.PU:13
tcn119 Package:0.NUMANode:1.Core:28.PU:28
tcn119 Package:0.NUMANode:1.Core:30.PU:30
tcn119 Package:0.NUMANode:1.Core:26.PU:26
tcn119 Package:0.NUMANode:1.Core:24.PU:24
tcn119 Package:0.NUMANode:1.Core:19.PU:19
tcn119 Package:0.NUMANode:1.Core:17.PU:17
tcn119 Package:0.NUMANode:1.Core:23.PU:23
tcn119 Package:0.NUMANode:2.Core:34.PU:34
tcn119 Package:0.NUMANode:2.Core:32.PU:32
tcn119 Package:0.NUMANode:0.Core:15.PU:15
tcn119 Package:0.NUMANode:1.Core:21.PU:21
tcn119 Package:0.NUMANode:1.Core:25.PU:25
tcn119 Package:0.NUMANode:2.Core:38.PU:38
tcn119 Package:1.NUMANode:4.Core:1.PU:65
tcn119 Package:1.NUMANode:4.Core:2.PU:66
tcn119 Package:1.NUMANode:4.Core:0.PU:64
tcn119 Package:1.NUMANode:4.Core:4.PU:68
tcn119 Package:1.NUMANode:4.Core:6.PU:70
tcn119 Package:1.NUMANode:4.Core:5.PU:69
tcn119 Package:1.NUMANode:4.Core:10.PU:74
tcn119 Package:1.NUMANode:4.Core:8.PU:72
tcn119 Package:0.NUMANode:2.Core:36.PU:36
tcn119 Package:1.NUMANode:4.Core:12.PU:76
tcn119 Package:1.NUMANode:4.Core:14.PU:78
tcn119 Package:1.NUMANode:5.Core:16.PU:80
tcn119 Package:1.NUMANode:4.Core:3.PU:67
tcn119 Package:1.NUMANode:4.Core:9.PU:73
tcn119 Package:1.NUMANode:5.Core:22.PU:86
tcn119 Package:1.NUMANode:5.Core:18.PU:82
tcn119 Package:1.NUMANode:5.Core:19.PU:83
tcn119 Package:1.NUMANode:4.Core:15.PU:79
tcn119 Package:1.NUMANode:5.Core:20.PU:84
tcn119 Package:1.NUMANode:5.Core:30.PU:94
tcn119 Package:1.NUMANode:5.Core:28.PU:92
tcn119 Package:1.NUMANode:5.Core:26.PU:90
tcn119 Package:1.NUMANode:4.Core:11.PU:75
tcn119 Package:1.NUMANode:5.Core:27.PU:91
tcn119 Package:0.NUMANode:3.Core:57.PU:57
tcn119 Package:0.NUMANode:3.Core:59.PU:59
tcn119 Package:0.NUMANode:3.Core:58.PU:58
tcn119 Package:0.NUMANode:3.Core:50.PU:50
tcn119 Package:0.NUMANode:3.Core:51.PU:51
tcn119 Package:0.NUMANode:2.Core:44.PU:44
tcn119 Package:0.NUMANode:2.Core:45.PU:45
tcn119 Package:0.NUMANode:2.Core:46.PU:46
tcn119 Package:0.NUMANode:2.Core:47.PU:47
tcn119 Package:0.NUMANode:3.Core:48.PU:48
tcn119 Package:0.NUMANode:3.Core:49.PU:49
tcn119 Package:0.NUMANode:3.Core:52.PU:52
tcn119 Package:0.NUMANode:3.Core:53.PU:53
tcn119 Package:0.NUMANode:3.Core:54.PU:54
tcn119 Package:0.NUMANode:3.Core:55.PU:55
tcn119 Package:0.NUMANode:3.Core:56.PU:56
tcn119 Package:1.NUMANode:4.Core:7.PU:71
tcn119 Package:0.NUMANode:3.Core:60.PU:60
tcn119 Package:0.NUMANode:3.Core:61.PU:61
tcn119 Package:0.NUMANode:3.Core:62.PU:62
tcn119 Package:0.NUMANode:3.Core:63.PU:63
tcn119 Package:0.NUMANode:1.Core:27.PU:27
tcn119 Package:0.NUMANode:1.Core:29.PU:29
tcn119 Package:0.NUMANode:1.Core:31.PU:31
tcn119 Package:0.NUMANode:2.Core:33.PU:33
tcn119 Package:0.NUMANode:2.Core:35.PU:35
tcn119 Package:0.NUMANode:2.Core:37.PU:37
tcn119 Package:0.NUMANode:2.Core:39.PU:39
tcn119 Package:0.NUMANode:2.Core:40.PU:40
tcn119 Package:0.NUMANode:2.Core:41.PU:41
tcn119 Package:0.NUMANode:2.Core:42.PU:42
tcn119 Package:0.NUMANode:2.Core:43.PU:43
tcn119 Package:1.NUMANode:7.Core:57.PU:121
tcn119 Package:1.NUMANode:7.Core:58.PU:122
tcn119 Package:1.NUMANode:7.Core:59.PU:123
tcn119 Package:1.NUMANode:7.Core:60.PU:124
tcn119 Package:1.NUMANode:7.Core:61.PU:125
tcn119 Package:1.NUMANode:7.Core:63.PU:127
tcn119 Package:1.NUMANode:7.Core:62.PU:126
tcn119 Package:1.NUMANode:7.Core:51.PU:115
tcn119 Package:1.NUMANode:7.Core:52.PU:116
tcn119 Package:1.NUMANode:7.Core:53.PU:117
tcn119 Package:1.NUMANode:6.Core:46.PU:110
tcn119 Package:1.NUMANode:7.Core:49.PU:113
tcn119 Package:1.NUMANode:6.Core:45.PU:109
tcn119 Package:1.NUMANode:7.Core:50.PU:114
tcn119 Package:1.NUMANode:6.Core:35.PU:99
tcn119 Package:1.NUMANode:7.Core:56.PU:120
tcn119 Package:1.NUMANode:6.Core:39.PU:103
tcn119 Package:1.NUMANode:7.Core:55.PU:119
tcn119 Package:1.NUMANode:7.Core:54.PU:118
tcn119 Package:1.NUMANode:6.Core:40.PU:104
tcn119 Package:1.NUMANode:6.Core:47.PU:111
tcn119 Package:1.NUMANode:6.Core:34.PU:98
tcn119 Package:1.NUMANode:7.Core:48.PU:112
tcn119 Package:1.NUMANode:6.Core:41.PU:105
tcn119 Package:1.NUMANode:6.Core:42.PU:106
tcn119 Package:1.NUMANode:6.Core:44.PU:108
tcn119 Package:1.NUMANode:6.Core:43.PU:107
tcn119 Package:1.NUMANode:6.Core:36.PU:100
tcn119 Package:1.NUMANode:6.Core:37.PU:101
tcn119 Package:1.NUMANode:6.Core:33.PU:97
tcn119 Package:1.NUMANode:6.Core:32.PU:96
tcn119 Package:1.NUMANode:6.Core:38.PU:102
tcn119 Package:1.NUMANode:5.Core:29.PU:93
tcn119 Package:1.NUMANode:5.Core:31.PU:95
tcn119 Package:1.NUMANode:5.Core:21.PU:85
tcn119 Package:1.NUMANode:5.Core:23.PU:87
tcn119 Package:1.NUMANode:4.Core:13.PU:77
tcn119 Package:1.NUMANode:5.Core:25.PU:89
tcn119 Package:1.NUMANode:5.Core:24.PU:88
tcn119 Package:1.NUMANode:5.Core:17.PU:81

@smoors
Copy link
Copy Markdown
Collaborator Author

smoors commented Feb 12, 2026

@satishskamath is there nothing in the rfm.err file?

@satishskamath
Copy link
Copy Markdown
Collaborator

@smoors Sorry, something went wrong during copy paste. I think this can be merged.

@satishskamath
Copy link
Copy Markdown
Collaborator

satishskamath commented Feb 12, 2026

But another concern which probably can be addressed in another PR, is the time required to do this check. Right now it is default for all tests, I assume, may be for large scaled tests (> 4 nodes), does the overhead become large?

@satishskamath
Copy link
Copy Markdown
Collaborator

Last check, OpenFOAM all tests properly generated:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OPENFOAM_LID_DRIVEN_CAVITY_64M_46eb8ecb"
#SBATCH --ntasks=1024
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=2:0:0
#SBATCH -p rome
#SBATCH --export=None
#SBATCH --mem=218749M
module load EESSI/2023.06
module load OpenFOAM/v2312-foss-2023a
export OMP_NUM_THREADS=1
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=1:compact
export OMPI_MCA_rmaps_base_mapping_policy=slot:PE=1
cd ./cavity3D/64M/fixedTol
source $FOAM_BASH
foamDictionary -entry numberOfSubdomains -set 1024 system/decomposeParDict
blockMesh 2>&1 | tee log.blockMesh
mpirun -np 1024 redistributePar -decompose -parallel 2>&1 | tee log.decompose
mpirun -np 1024 renumberMesh -parallel -overwrite 2>&1 | tee log.renumberMesh
if command -v hwloc-calc >/dev/null; then
mpirun -np 1024 /gpfs/home5/satishk/projects/test-suite/eessi/testsuite/get_process_binding.sh | tee /dev/stderr | /gpfs/home5/satishk/projects/test-suite/eessi/testsuite/check_process_binding.py --cpus-per-proc 1 --procs 1024 --nodes 8
else
echo 'PROCESS BINDING WARNING: hwloc not available, skipping process binding check' >/dev/stderr
fi
mpirun -np 1024 icoFoam -parallel 2>&1 | tee log.icofoam
echo "EESSI_CVMFS_REPO: $EESSI_CVMFS_REPO"
echo "EESSI_SOFTWARE_SUBDIR: $EESSI_SOFTWARE_SUBDIR"
echo "FULL_MODULEPATH: $(module --location show OpenFOAM/v2312-foss-2023a 2>&1)"

Copy link
Copy Markdown
Collaborator

@satishskamath satishskamath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@satishskamath satishskamath merged commit a337e62 into EESSI:main Feb 12, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amend mixin class to run with mpirun ... --report-bindings and do a sanity check on the binding

3 participants