Skip to content

Test with OpenMPI 4.1.7 may indicate a bug in mpirma engine #48

@anyzelman

Description

@anyzelman

Test platform: x86_64, CentOS Stream 9, GCC 14.2, OpenMPI 4.1.7. Build with the functional test suite enabled and running them, yields the following failed tests:

96% tests passed, 32 tests failed out of 752

Total Test time (real) = 4376.13 sec

The following tests FAILED:
	311 - mpirma_API.func_bsplib_hpget_many (Failed)
	312 - mpirma_API.func_bsplib_hpput_many (Failed)
	313 - mpirma_API.func_bsplib_hpsend_many (Failed)
	360 - mpirma_API.func_lpf_probe_parallel_full (Failed)
	361 - mpirma_API.func_lpf_probe_parallel_nested (Failed)
	362 - mpirma_API.func_lpf_probe_root (Failed)
	378 - mpirma_API.func_lpf_register_local_parallel_multiple (Failed)
	404 - hybrid_API.func_bsplib_hpget_many (Failed)
	405 - hybrid_API.func_bsplib_hpput_many (Failed)
	406 - hybrid_API.func_bsplib_hpsend_many (Failed)
	453 - hybrid_API.func_lpf_probe_parallel_full (Failed)
	454 - hybrid_API.func_lpf_probe_parallel_nested (Failed)
	455 - hybrid_API.func_lpf_probe_root (Failed)
	471 - hybrid_API.func_lpf_register_local_parallel_multiple (Failed)
	725 - mpirma_COLL.func_lpf_allcombine (Failed)
	726 - mpirma_COLL.func_lpf_allgather (Failed)
	727 - mpirma_COLL.func_lpf_allgather_overlapped (Failed)
	729 - mpirma_COLL.func_lpf_alltoall (Failed)
	730 - mpirma_COLL.func_lpf_broadcast (Failed)
	733 - mpirma_COLL.func_lpf_collectives_init (Failed)
	735 - mpirma_COLL.func_lpf_combine (Failed)
	736 - mpirma_COLL.func_lpf_gather (Failed)
	738 - mpirma_COLL.func_lpf_scatter (Failed)
	739 - hybrid_COLL.func_lpf_allcombine (Failed)
	740 - hybrid_COLL.func_lpf_allgather (Failed)
	741 - hybrid_COLL.func_lpf_allgather_overlapped (Failed)
	743 - hybrid_COLL.func_lpf_alltoall (Failed)
	744 - hybrid_COLL.func_lpf_broadcast (Failed)
	747 - hybrid_COLL.func_lpf_collectives_init (Failed)
	749 - hybrid_COLL.func_lpf_combine (Failed)
	750 - hybrid_COLL.func_lpf_gather (Failed)
	752 - hybrid_COLL.func_lpf_scatter (Failed)
Errors while running CTest

Re-running the first one in verbose yields:

 ctest -V -R mpirma_API.func_bsplib_hpget_many
UpdateCTestConfiguration  from :/home/yzelman/Documents/lpf/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/yzelman/Documents/lpf/build/DartConfiguration.tcl
Test project /home/yzelman/Documents/lpf/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 311
    Start 311: mpirma_API.func_bsplib_hpget_many

311: Test command: /usr/bin/python3.9 "/home/yzelman/Documents/lpf/build/test_launcher.py" "--engine" "mpirma" "--parallel_launcher" "/home/yzelman/Documents/lpf/build/lpfrun_build" "--min_process_count" "1" "--max_process_count" "5" "--lpf_probe_timer" "0.0" "--expected_return_code" "0" "/home/yzelman/Documents/lpf/build/tests/functional/func_bsplib_hpget_many_mpirma_Release_debug" "--gtest_filter=API.func_bsplib_hpget_many" "--gtest_also_run_disabled_tests" "--gtest_output=xml:/home/yzelman/Documents/lpf/build/junit/mpirma_func_bsplib_hpget_many_mpirma_Release_debug"
311: Working Directory: /home/yzelman/Documents/lpf/build/tests/functional
311: Test timeout computed to be: 10000000
311: Running main() from /builddir/build/BUILD/googletest-release-1.11.0/googletest/src/gtest_main.cc
311: Note: Google Test filter = API.func_bsplib_hpget_many
311: [==========] Running 1 test from 1 test suite.
311: [----------] Global test environment set-up.
311: [----------] 1 test from API
311: [ RUN      ] API.func_bsplib_hpget_many
311: [localhost:3706043] Attempt to free memory that is still in use by an ongoing MPI communication (buffer 0x7f4abcee0000, size 421888).  MPI job will now abort.

This is a very clear error message that may indeed indicate an issue in the mpirma engine.

For completeness, the verbose output continues:

311: --------------------------------------------------------------------------
311: Primary job  terminated normally, but 1 process returned
311: a non-zero exit code. Per user-direction, the job has been aborted.
311: --------------------------------------------------------------------------
311: --------------------------------------------------------------------------
311: mpirun detected that one or more processes exited with non-zero status, thus causing
311: the job to be terminated. The first process to do so was:
311: 
311:   Process name: [[23552,1],0]
311:   Exit code:    1
311: --------------------------------------------------------------------------
311: Run command: 
311: ['/home/yzelman/Documents/lpf/build/lpfrun_build', '-engine', 'mpirma', '-n', '1', '/home/yzelman/Documents/lpf/build/tests/functional/func_bsplib_hpget_many_mpirma_Release_debug', '--gtest_filter=API.func_bsplib_hpget_many', '--gtest_also_run_disabled_tests', '--gtest_output=xml:/home/yzelman/Documents/lpf/build/junit/mpirma_func_bsplib_hpget_many_mpirma_Release_debug']
311: Test returned code = 1
311: Test /home/yzelman/Documents/lpf/build/tests/functional/func_bsplib_hpget_many_mpirma_Release_debug--gtest_filter=API.func_bsplib_hpget_many
311: returned	1
311: expected return code was: 0
1/1 Test #311: mpirma_API.func_bsplib_hpget_many ...***Failed    2.78 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   2.88 sec

The following tests FAILED:
	311 - mpirma_API.func_bsplib_hpget_many (Failed)
Errors while running CTest
Output from these tests are in: /home/yzelman/Documents/lpf/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

Note that the test failures include #45 which is very likely a separate issue from this one. Resolving this issue may also resolve issue #42 (the OpenMPI implementation used there did not print the above referred-to clear error message). Note that this issue also includes issues with the hybrid engine -- this is because the hybrid engine, due to a lack of an ibverbs NIC, uses mpirma-- so it is likely resolving this issue for mpirma will resolve it for hybrid also.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions