There are some hacks that need to be cleaned up, because the current wheel torch==2.10.0+rocm710 comes with libraries linked to cray-mpich/9.0.1 which causes segfaults. We replace the paths with paths to cray-mpich/9.1.0.
cleanup1 #17 (comment)
cleanup2 #17 (comment)
The hacks replace shared library links for libmpi_gnu_112.so.12 in .venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/lib/ to libmpi_gnu.so.12, which is the name for 9.1. Then in the jobscript we can LD_PRELOAD /opt/cray/pe/mpich/9.1.0/ofi/gnu/11.2/lib/libmpi_gnu.so.12 to use the correct libmpi.
There are some hacks that need to be cleaned up, because the current wheel
torch==2.10.0+rocm710comes with libraries linked tocray-mpich/9.0.1which causes segfaults. We replace the paths with paths tocray-mpich/9.1.0.cleanup1 #17 (comment)
cleanup2 #17 (comment)
The hacks replace shared library links for
libmpi_gnu_112.so.12in .venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/lib/ tolibmpi_gnu.so.12, which is the name for9.1. Then in the jobscript we can LD_PRELOAD/opt/cray/pe/mpich/9.1.0/ofi/gnu/11.2/lib/libmpi_gnu.so.12to use the correct libmpi.