Skip to content

Non-determinstic errors with recently refactored tag-based zero-engine #58

@KADichev

Description

@KADichev

Even after performing allocation via verbs->resizeTag(1000) to avoid #59, we have issues:

/storage/home/kdichev/LPF-gitlab2/build/lpfrun_build -engine zero -n 2 /storage/home/kdichev/LPF-gitlab2/build/src/MPI/zero_test --gtest_filter=ZeroTests.resizeMemreg
Running main() from /scratch/kdichev/.spack/stage/spack-stage-googletest-1.14.0-th5nac5n2cvmf3nluwlgarz242h2bug6/spack-src/googletest/src/gtest_main.cc
Note: Google Test filter = ZeroTests.resizeMemreg
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ZeroTests
Running main() from /scratch/kdichev/.spack/stage/spack-stage-googletest-1.14.0-th5nac5n2cvmf3nluwlgarz242h2bug6/spack-src/googletest/src/gtest_main.cc
Note: Google Test filter = ZeroTests.resizeMemreg
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ZeroTests
[srv04:1240383:0:1240385] Caught signal 4 (Illegal instruction: illegal opcode)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x536af6 vs 0x3d6ae8)
==== backtrace (tid:1240385) ====
 0 0x00000000000e1e4c munmap()  ???:0
 1 0x000000000008a39c timer_settime()  ???:0
 2 0x000000000008a868 timer_settime()  ???:0
 3 0x000000000008cfa0 __default_morecore()  ???:0
 4 0x000000000008d778 malloc()  ???:0
 5 0x0000000000029370 get_print_name_buffer()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/orte/util/name_fns.c:106
 6 0x0000000000029370 get_print_name_buffer()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/orte/util/name_fns.c:88
 7 0x00000000000293d4 orte_util_print_jobids()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/orte/util/name_fns.c:171
 8 0x00000000000297c4 orte_util_print_name_args()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/orte/util/name_fns.c:143
 9 0x0000000000098034 _process_name_print_for_opal()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/orte/runtime/orte_init.c:68
10 0x0000000000005870 process_event()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/opal/mca/pmix/pmix3x/pmix3x.c:256
11 0x00000000000803b8 event_process_active_single_queue()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/opal/mca/event/libevent2022/libevent/event.c:1370
12 0x00000000000803b8 event_process_active()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/opal/mca/event/libevent2022/libevent/event.c:1440
13 0x00000000000803b8 opal_libevent2022_event_base_loop()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/opal/mca/event/libevent2022/libevent/event.c:1644
14 0x000000000003c6cc progress_engine()  /build-result/src/hpcx-v2.13.1-gcc-MLNX_OFED_LINUX-5-ubuntu22.04-cuda11-gdrcopy2-nccl2.12-aarch64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/opal/runtime/opal_progress_threads.c:105
15 0x000000000007d5b8 pthread_condattr_setpshared()  ???:0
16 0x00000000000e5edc clone()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1240383 on node srv04 exited on signal 4 (Illegal instruction).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions