DAOS-17427 control: Restart excluded rank after suicide by tanabarr · Pull Request #16279 · daos-stack/daos

tanabarr · 2025-04-17T09:25:01Z

When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.

The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.

Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.

Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.

Features: pool control

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2025-04-17T09:25:22Z

Ticket title is 'Handle engine suicides by automatically restarting the engines'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17427

daosbuild1 · 2025-04-18T12:29:45Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-16279/1/execution/node/1466/log

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

liw

Not requesting changes.

liw · 2026-03-19T01:46:16Z

+				D_ERROR("failed to handle %u/%u event: " DF_RC "\n", src, type,
+					DP_RC(rc));
+
 			rc = kill(getpid(), SIGKILL);


[Discussion] Looking at the code changes I think I'm back to the previously discussed point: Whether the engine kills itself, or only informs the server, who will decide what to do. If the engine kills itself, why do we need the RAS event... The server can simply decide whether to restart a killed engine, even without the RAS event, cannot it...

[Question] If the ds_notify_rank_suicide call fails, then the engine will still terminate, but will the server restart the engine?

[Discussion] I implemented kill+ras mainly because it is guaranteed that the engine process will be killed, the RAS event signifies that it is a suicide as opposed to other termination, do we really want to always restart a terminated rank regardless of reason?

[Question] Yes, if the notify call fails then the engine will be terminated but not restarted. if instead we don't terminate the engine on group map change and just send ras and put the engine into a blocking state then if a notify call fails then the engine will effectively hang.

do we really want to always restart a terminated rank regardless of reason?

Hmm, it's indeed hard to say for sure. I'd understand if we'd like to begin conservatively, only restarting for certain cases.

I agree we should start conservatively. For engines that crash, or exit for some unknown reason, we don't know if the engine would be OK if we restarted it. Better to let the admin investigate and manually restart the ranks in that case.

knard38 · 2026-03-19T08:50:23Z

@tanabarr, I had the same issue with the CI regarding the "Unit test beds with memcheck".
At the end, I have merged with master and restarted the CI.
Now, this stage is successfully run.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Features: control pool Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-03-28T01:54:36Z

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/5/execution/node/1303/log

daosbuild3 · 2026-03-30T12:51:56Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/5/testReport/

tanabarr · 2026-03-31T10:03:01Z

PR CI run with Features: pool control to give an extra coverage in case restarting engine automatically causes any tests to fail. BoundaryTest and ListVerboseTest failures are unrelated.

@kjacque @knard38 @mjmac @liw can I get reviews please?

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr · 2026-03-31T10:09:33Z

conflicts resolved, doc-only change, CI run no. 5 is still relevant and should be used for PR review

…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daltonbohning

It seems the new functional tests aren't even running in CI?
There are many lint errors in ftest that need to be resolved or possibly ignored

daltonbohning · 2026-05-12T20:27:01Z

+    def tearDown(self):
+        """Clean up after each test method."""
+        # Reset restart state for next test method
+        # This ensures clean state between sequential tests
+        try:
+            self.reset_engine_restart_state()
+        except Exception as error:
+            self.log.error("Failed to reset engine restart state: %s", error)
+            self.fail("tearDown failed to reset engine restart state: {}".format(error))
+        finally:
+            super().tearDown()


We have a way to handle this in the framework by calling

self.register_cleanup(reset_engine_restart_state)

After whatever operation puts the system into the invalid state. So maybe instead of defining this tearDown, we can do that after calling exclude_rank_and_wait_restart the first time

daltonbohning · 2026-05-12T20:29:19Z

+        """
+        all_ranks = self.get_all_ranks()
+        if len(all_ranks) < 2:
+            self.skipTest("Test requires at least 2 ranks")


It is better to fail because skipping will be silent and easily ignored

Suggested change

self.skipTest("Test requires at least 2 ranks")

self.fail("Test requires at least 2 ranks")

daltonbohning · 2026-05-12T20:31:05Z

+        final_incarnation = self.get_rank_incarnation(test_rank)
+        if final_incarnation is None:
+            self.fail(f"failed to get final incarnation for rank {test_rank}")


It would be better if get_rank_incarnation raised an exception instead of silently returning None

daltonbohning · 2026-05-12T20:47:47Z

+        self.server_managers[0].system_stop()
+        time.sleep(2)
+        self.server_managers[0].system_start()


Where does the arbitrary 2s sleep come from? This will eventually be a problem and we will have to revisit

will remove if you don't think it necessary for a system restart

If we don't need it at all then that is great, but I was more meaning that 2s seems arbitrary. And historically arbitrary sleeps have been an issue that we have to revisit later

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-05-12T20:54:16Z

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/28/execution/node/942/log

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com> Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

…ndFailure exception for helpers and register cleanup in setUp for each test class Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

…gine-suicide-restart Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daltonbohning · 2026-05-13T16:37:58Z

+        self.log_step(f"Waiting for rank {rank} to self-terminate")
+        time.sleep(2)
+
+        # Check if rank is adminexcluded


Similar to some other cases: how do we know 2s is enough? If there really is not a deterministic way to know at this code level, this is fine, but eventually this kind of thing needs to be revisited.

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daosbuild3 · 2026-05-13T20:27:51Z

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16279/33/display/redirect

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

mjmac · 2026-05-13T20:49:18Z

Should wait_for_resp be true now, in order to ensure that the event processing doesn't race with the SIGKILL? I don't really see a downside...

mjmac · 2026-05-13T21:24:25Z

+	rank := req.rank
+	instance := req.instance
+
+	mgr.log.Debugf("processing restart request for rank %d", rank)
+
+	canRestart, delay := mgr.canRestartNow(rank)
+	if !canRestart {
+		mgr.log.Noticef("rank %d restart rate limited: will restart in %s",
+			rank, delay.Round(time.Second))
+
+		// Schedule deferred restart
+		timer := time.AfterFunc(delay, func() {
+			mgr.log.Noticef("deferred restart triggered for rank %d after rate-limit delay", rank)
+			mgr.performRestart(ctx, rank, instance)
+		})
+
+		// Overwrite any existing pending restart
+		mgr.setPendingRestart(rank, timer)
+		return
+	}
+
+	// Can restart immediately
+	mgr.performRestart(ctx, rank, instance)


There is still a race here, I think. If multiple restart requests come in at the same time, there's a window where the manager goroutine and the AfterFunc goroutine could both wind up calling performRestart().

If you rewrite this method to make the decision under the lock, I think you eliminate the race and also easily deal with floods of restart requests:

rank, instance := req.rank, req.instance mgr.mu.Lock() if last, ok := mgr.lastRestart[rank]; ok { if elapsed := time.Since(last); elapsed < mgr.getMinDelay() { // Fast debounce for subsequent requests inside of the delay window if _, pending := mgr.pendingRestart[rank]; pending { mgr.mu.Unlock() mgr.log.Debugf("rank %d already has a deferred restart pending; dropping", rank) return } // First restart request inside of the delay window claims it remaining := mgr.getMinDelay() - elapsed mgr.pendingRestart[rank] = time.AfterFunc(remaining, func() { mgr.requestRestart(rank, instance) }) mgr.mu.Unlock() mgr.log.Noticef("rank %d restart rate limited: will restart in %s", rank, remaining.Round(time.Second)) return } } // If this is the first restart or it's outside of the delay window, start the process (over) mgr.lastRestart[rank] = time.Now() delete(mgr.pendingRestart, rank) mgr.mu.Unlock() if err := waitForEngineStopped(ctx, []Engine{instance}); err != nil { mgr.log.Errorf("rank %d did not stop before restart: %s", rank, err) return } mgr.log.Noticef("restart manager is restarting rank %d", rank) instance.requestStart(ctx)

tanabarr self-assigned this Apr 17, 2025

tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Apr 17, 2025

tanabarr changed the title ~~DAOS-17427 control: Restart evicted rank after suicide~~ DAOS-17427 control: Restart excluded rank after suicide Apr 19, 2025

DAOS-17427 control: Restart evicted rank after suicide

b94c921

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch 2 times, most recently from f1a124f to e57b0d3 Compare March 17, 2026 22:57

tanabarr requested review from kjacque, knard38, liw and mjmac March 17, 2026 22:58

implement suicide event handlers

af7f056

Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch from e57b0d3 to af7f056 Compare March 19, 2026 01:17

liw reviewed Mar 19, 2026

View reviewed changes

tanabarr added 4 commits March 19, 2026 16:36

add unit testing and documentation

4ce711f

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

fix docs and unit tests

550ef12

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

revise unit test for suicide handler

1f61b98

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

fixup tests

8a79efb

Features: control pool Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr marked this pull request as ready for review March 27, 2026 13:07

tanabarr requested review from a team as code owners March 27, 2026 13:07

tanabarr requested a review from liw March 27, 2026 13:07

Merge branch 'master' into tanabarr/control-engine-suicide-restart

f643187

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

tanabarr added 2 commits May 12, 2026 11:32

Merge remote-tracking branch 'origin/master' into tanabarr/control-en…

8f82660

…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

comment one start/stop per process lifetime

5feadfe

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr dismissed kjacque’s stale review via 5feadfe May 12, 2026 11:32

address more review comments from kjacque

969c837

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daltonbohning reviewed May 12, 2026

View reviewed changes

pylint fixes

835156b

Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr and others added 3 commits May 13, 2026 10:17

using self.register_cleanup (#18240)

95c061a

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com> Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>

Apply suggestion from @daltonbohning

9ab86a9

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

more ftest related review comment updates

1913fd9

Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr commented May 13, 2026

View reviewed changes

Comment thread src/tests/ftest/util/control_test_base.py

tanabarr added 2 commits May 13, 2026 12:51

f-string updates and remove step comments in log_step calls use Comma…

9dc146c

…ndFailure exception for helpers and register cleanup in setUp for each test class Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Merge remote-tracking branch 'origin/master' into tanabarr/control-en…

48ae917

…gine-suicide-restart Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

daltonbohning reviewed May 13, 2026

View reviewed changes

tanabarr and others added 3 commits May 13, 2026 21:26

Update src/tests/ftest/control/engine_auto_restart.yaml

28e4905

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Update src/tests/ftest/control/engine_auto_restart_disabled.yaml

69c7327

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Update src/tests/ftest/control/engine_auto_restart_disabled.yaml

f1c54c9

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr and others added 6 commits May 13, 2026 21:27

Update src/tests/ftest/control/engine_auto_restart_advanced.yaml

335590e

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Update src/tests/ftest/control/engine_auto_restart_disabled.py

031e5af

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Update src/tests/ftest/control/engine_auto_restart.yaml

d6b7993

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Update src/tests/ftest/control/engine_auto_restart_advanced.yaml

de28a9f

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Update src/tests/ftest/control/engine_auto_restart_disabled.py

42441e3

Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

fail if delay > 200% of expected

8c14077

Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr requested review from daltonbohning and kjacque May 13, 2026 21:00

mjmac reviewed May 13, 2026

View reviewed changes

tanabarr requested review from a team as code owners May 14, 2026 15:31

tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch from bba7c65 to 8c14077 Compare May 14, 2026 15:32

	self.skipTest("Test requires at least 2 ranks")
	self.fail("Test requires at least 2 ranks")

Conversation

tanabarr commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daosbuild1 commented Apr 18, 2025

Uh oh!

liw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liw Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knard38 commented Mar 19, 2026

Uh oh!

daosbuild3 commented Mar 28, 2026

Uh oh!

daosbuild3 commented Mar 30, 2026

Uh oh!

tanabarr commented Mar 31, 2026

Uh oh!

tanabarr commented Mar 31, 2026

Uh oh!

daltonbohning left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tanabarr commented Apr 17, 2025 •

edited

Loading

github-actions Bot commented Apr 17, 2025 •

edited

Loading

liw Mar 19, 2026 •

edited

Loading