-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Test platform: x86_64, CentOS Stream 9, GCC 14.2, OpenMPI 4.1.7. Build with the functional test suite enabled and running them, yields the following failed tests:
96% tests passed, 32 tests failed out of 752
Total Test time (real) = 4376.13 sec
The following tests FAILED:
311 - mpirma_API.func_bsplib_hpget_many (Failed)
312 - mpirma_API.func_bsplib_hpput_many (Failed)
313 - mpirma_API.func_bsplib_hpsend_many (Failed)
360 - mpirma_API.func_lpf_probe_parallel_full (Failed)
361 - mpirma_API.func_lpf_probe_parallel_nested (Failed)
362 - mpirma_API.func_lpf_probe_root (Failed)
378 - mpirma_API.func_lpf_register_local_parallel_multiple (Failed)
404 - hybrid_API.func_bsplib_hpget_many (Failed)
405 - hybrid_API.func_bsplib_hpput_many (Failed)
406 - hybrid_API.func_bsplib_hpsend_many (Failed)
453 - hybrid_API.func_lpf_probe_parallel_full (Failed)
454 - hybrid_API.func_lpf_probe_parallel_nested (Failed)
455 - hybrid_API.func_lpf_probe_root (Failed)
471 - hybrid_API.func_lpf_register_local_parallel_multiple (Failed)
725 - mpirma_COLL.func_lpf_allcombine (Failed)
726 - mpirma_COLL.func_lpf_allgather (Failed)
727 - mpirma_COLL.func_lpf_allgather_overlapped (Failed)
729 - mpirma_COLL.func_lpf_alltoall (Failed)
730 - mpirma_COLL.func_lpf_broadcast (Failed)
733 - mpirma_COLL.func_lpf_collectives_init (Failed)
735 - mpirma_COLL.func_lpf_combine (Failed)
736 - mpirma_COLL.func_lpf_gather (Failed)
738 - mpirma_COLL.func_lpf_scatter (Failed)
739 - hybrid_COLL.func_lpf_allcombine (Failed)
740 - hybrid_COLL.func_lpf_allgather (Failed)
741 - hybrid_COLL.func_lpf_allgather_overlapped (Failed)
743 - hybrid_COLL.func_lpf_alltoall (Failed)
744 - hybrid_COLL.func_lpf_broadcast (Failed)
747 - hybrid_COLL.func_lpf_collectives_init (Failed)
749 - hybrid_COLL.func_lpf_combine (Failed)
750 - hybrid_COLL.func_lpf_gather (Failed)
752 - hybrid_COLL.func_lpf_scatter (Failed)
Errors while running CTest
Re-running the first one in verbose yields:
ctest -V -R mpirma_API.func_bsplib_hpget_many
UpdateCTestConfiguration from :/home/yzelman/Documents/lpf/build/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/yzelman/Documents/lpf/build/DartConfiguration.tcl
Test project /home/yzelman/Documents/lpf/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 311
Start 311: mpirma_API.func_bsplib_hpget_many
311: Test command: /usr/bin/python3.9 "/home/yzelman/Documents/lpf/build/test_launcher.py" "--engine" "mpirma" "--parallel_launcher" "/home/yzelman/Documents/lpf/build/lpfrun_build" "--min_process_count" "1" "--max_process_count" "5" "--lpf_probe_timer" "0.0" "--expected_return_code" "0" "/home/yzelman/Documents/lpf/build/tests/functional/func_bsplib_hpget_many_mpirma_Release_debug" "--gtest_filter=API.func_bsplib_hpget_many" "--gtest_also_run_disabled_tests" "--gtest_output=xml:/home/yzelman/Documents/lpf/build/junit/mpirma_func_bsplib_hpget_many_mpirma_Release_debug"
311: Working Directory: /home/yzelman/Documents/lpf/build/tests/functional
311: Test timeout computed to be: 10000000
311: Running main() from /builddir/build/BUILD/googletest-release-1.11.0/googletest/src/gtest_main.cc
311: Note: Google Test filter = API.func_bsplib_hpget_many
311: [==========] Running 1 test from 1 test suite.
311: [----------] Global test environment set-up.
311: [----------] 1 test from API
311: [ RUN ] API.func_bsplib_hpget_many
311: [localhost:3706043] Attempt to free memory that is still in use by an ongoing MPI communication (buffer 0x7f4abcee0000, size 421888). MPI job will now abort.
This is a very clear error message that may indeed indicate an issue in the mpirma engine.
For completeness, the verbose output continues:
311: --------------------------------------------------------------------------
311: Primary job terminated normally, but 1 process returned
311: a non-zero exit code. Per user-direction, the job has been aborted.
311: --------------------------------------------------------------------------
311: --------------------------------------------------------------------------
311: mpirun detected that one or more processes exited with non-zero status, thus causing
311: the job to be terminated. The first process to do so was:
311:
311: Process name: [[23552,1],0]
311: Exit code: 1
311: --------------------------------------------------------------------------
311: Run command:
311: ['/home/yzelman/Documents/lpf/build/lpfrun_build', '-engine', 'mpirma', '-n', '1', '/home/yzelman/Documents/lpf/build/tests/functional/func_bsplib_hpget_many_mpirma_Release_debug', '--gtest_filter=API.func_bsplib_hpget_many', '--gtest_also_run_disabled_tests', '--gtest_output=xml:/home/yzelman/Documents/lpf/build/junit/mpirma_func_bsplib_hpget_many_mpirma_Release_debug']
311: Test returned code = 1
311: Test /home/yzelman/Documents/lpf/build/tests/functional/func_bsplib_hpget_many_mpirma_Release_debug--gtest_filter=API.func_bsplib_hpget_many
311: returned 1
311: expected return code was: 0
1/1 Test #311: mpirma_API.func_bsplib_hpget_many ...***Failed 2.78 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) = 2.88 sec
The following tests FAILED:
311 - mpirma_API.func_bsplib_hpget_many (Failed)
Errors while running CTest
Output from these tests are in: /home/yzelman/Documents/lpf/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
Note that the test failures include #45 which is very likely a separate issue from this one. Resolving this issue may also resolve issue #42 (the OpenMPI implementation used there did not print the above referred-to clear error message). Note that this issue also includes issues with the hybrid engine -- this is because the hybrid engine, due to a lack of an ibverbs NIC, uses mpirma-- so it is likely resolving this issue for mpirma will resolve it for hybrid also.