Skip to content

DAOS-17427 control: Restart excluded rank after suicide#16279

Open
tanabarr wants to merge 55 commits into
masterfrom
tanabarr/control-engine-suicide-restart
Open

DAOS-17427 control: Restart excluded rank after suicide#16279
tanabarr wants to merge 55 commits into
masterfrom
tanabarr/control-engine-suicide-restart

Conversation

@tanabarr
Copy link
Copy Markdown
Contributor

@tanabarr tanabarr commented Apr 17, 2025

When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.

The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.

Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.

Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.

Features: pool control

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr self-assigned this Apr 17, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2025

Ticket title is 'Handle engine suicides by automatically restarting the engines'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17427

@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Apr 17, 2025
@daosbuild1
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-16279/1/execution/node/1466/log

@tanabarr tanabarr changed the title DAOS-17427 control: Restart evicted rank after suicide DAOS-17427 control: Restart excluded rank after suicide Apr 19, 2025
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch 2 times, most recently from f1a124f to e57b0d3 Compare March 17, 2026 22:57
@tanabarr tanabarr requested review from kjacque, knard38, liw and mjmac March 17, 2026 22:58
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch from e57b0d3 to af7f056 Compare March 19, 2026 01:17
Copy link
Copy Markdown
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requesting changes.

Comment thread src/engine/init.c
D_ERROR("failed to handle %u/%u event: " DF_RC "\n", src, type,
DP_RC(rc));

rc = kill(getpid(), SIGKILL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Discussion] Looking at the code changes I think I'm back to the previously discussed point: Whether the engine kills itself, or only informs the server, who will decide what to do. If the engine kills itself, why do we need the RAS event... The server can simply decide whether to restart a killed engine, even without the RAS event, cannot it...

[Question] If the ds_notify_rank_suicide call fails, then the engine will still terminate, but will the server restart the engine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Discussion] I implemented kill+ras mainly because it is guaranteed that the engine process will be killed, the RAS event signifies that it is a suicide as opposed to other termination, do we really want to always restart a terminated rank regardless of reason?

[Question] Yes, if the notify call fails then the engine will be terminated but not restarted. if instead we don't terminate the engine on group map change and just send ras and put the engine into a blocking state then if a notify call fails then the engine will effectively hang.

Copy link
Copy Markdown
Contributor

@liw liw Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to always restart a terminated rank regardless of reason?

Hmm, it's indeed hard to say for sure. I'd understand if we'd like to begin conservatively, only restarting for certain cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should start conservatively. For engines that crash, or exit for some unknown reason, we don't know if the engine would be OK if we restarted it. Better to let the admin investigate and manually restart the ranks in that case.

@knard38
Copy link
Copy Markdown
Contributor

knard38 commented Mar 19, 2026

@tanabarr, I had the same issue with the CI regarding the "Unit test beds with memcheck".
At the end, I have merged with master and restarted the CI.
Now, this stage is successfully run.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control pool

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr marked this pull request as ready for review March 27, 2026 13:07
@tanabarr tanabarr requested review from a team as code owners March 27, 2026 13:07
@tanabarr tanabarr requested a review from liw March 27, 2026 13:07
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/5/execution/node/1303/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/5/testReport/

@tanabarr
Copy link
Copy Markdown
Contributor Author

PR CI run with Features: pool control to give an extra coverage in case restarting engine automatically causes any tests to fail. BoundaryTest and ListVerboseTest failures are unrelated.

@kjacque @knard38 @mjmac @liw can I get reviews please?

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr
Copy link
Copy Markdown
Contributor Author

conflicts resolved, doc-only change, CI run no. 5 is still relevant and should be used for PR review

tanabarr added 2 commits May 12, 2026 11:32
…gine-suicide-restart

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: pr control full_regression
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Copy link
Copy Markdown
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the new functional tests aren't even running in CI?
There are many lint errors in ftest that need to be resolved or possibly ignored

Comment on lines +21 to +31
def tearDown(self):
"""Clean up after each test method."""
# Reset restart state for next test method
# This ensures clean state between sequential tests
try:
self.reset_engine_restart_state()
except Exception as error:
self.log.error("Failed to reset engine restart state: %s", error)
self.fail("tearDown failed to reset engine restart state: {}".format(error))
finally:
super().tearDown()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a way to handle this in the framework by calling

self.register_cleanup(reset_engine_restart_state)

After whatever operation puts the system into the invalid state. So maybe instead of defining this tearDown, we can do that after calling exclude_rank_and_wait_restart the first time

Comment thread src/tests/ftest/control/engine_auto_restart.py
"""
all_ranks = self.get_all_ranks()
if len(all_ranks) < 2:
self.skipTest("Test requires at least 2 ranks")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to fail because skipping will be silent and easily ignored

Suggested change
self.skipTest("Test requires at least 2 ranks")
self.fail("Test requires at least 2 ranks")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread src/tests/ftest/control/engine_auto_restart.py
Comment on lines +68 to +70
final_incarnation = self.get_rank_incarnation(test_rank)
if final_incarnation is None:
self.fail(f"failed to get final incarnation for rank {test_rank}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if get_rank_incarnation raised an exception instead of silently returning None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread src/tests/ftest/control/engine_auto_restart.yaml Outdated
Comment thread src/tests/ftest/util/control_test_base.py Outdated
Comment thread src/tests/ftest/util/control_test_base.py Outdated
Comment thread src/tests/ftest/util/control_test_base.py Outdated
Comment on lines +214 to +216
self.server_managers[0].system_stop()
time.sleep(2)
self.server_managers[0].system_start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the arbitrary 2s sleep come from? This will eventually be a problem and we will have to revisit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove if you don't think it necessary for a system restart

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need it at all then that is great, but I was more meaning that 2s seems arbitrary. And historically arbitrary sleeps have been an issue that we have to revisit later

Test-tag: pr control full_regression
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

tanabarr and others added 3 commits May 13, 2026 10:17
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: hw,medium,dmg,control,engine_auto_restart
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Comment thread src/tests/ftest/util/control_test_base.py
tanabarr added 2 commits May 13, 2026 12:51
…ndFailure exception for helpers and register cleanup in setUp for each test class

Test-tag: hw,medium,dmg,control,engine_auto_restart
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…gine-suicide-restart

Test-tag: hw,medium,dmg,control,engine_auto_restart pr
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Comment thread src/tests/ftest/control/engine_auto_restart.yaml Outdated
Comment thread src/tests/ftest/control/engine_auto_restart.yaml Outdated
Comment thread src/tests/ftest/control/engine_auto_restart_advanced.py
Comment thread src/tests/ftest/control/engine_auto_restart_advanced.yaml Outdated
Comment thread src/tests/ftest/control/engine_auto_restart_advanced.yaml Outdated
Comment thread src/tests/ftest/control/engine_auto_restart_disabled.py Outdated
Comment thread src/tests/ftest/control/engine_auto_restart_disabled.py Outdated
Comment thread src/tests/ftest/control/engine_auto_restart_disabled.yaml Outdated
Comment thread src/tests/ftest/control/engine_auto_restart_disabled.yaml Outdated
Comment on lines +94 to +97
self.log_step(f"Waiting for rank {rank} to self-terminate")
time.sleep(2)

# Check if rank is adminexcluded
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to some other cases: how do we know 2s is enough? If there really is not a deterministic way to know at this code level, this is fine, but eventually this kind of thing needs to be revisited.

tanabarr and others added 3 commits May 13, 2026 21:26
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16279/33/display/redirect

tanabarr and others added 6 commits May 13, 2026 21:27
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: hw,medium,dmg,control,engine_auto_restart pr
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested review from daltonbohning and kjacque May 13, 2026 21:00
Comment thread src/engine/drpc_ras.c
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should wait_for_resp be true now, in order to ensure that the event processing doesn't race with the SIGKILL? I don't really see a downside...

Comment on lines +134 to +156
rank := req.rank
instance := req.instance

mgr.log.Debugf("processing restart request for rank %d", rank)

canRestart, delay := mgr.canRestartNow(rank)
if !canRestart {
mgr.log.Noticef("rank %d restart rate limited: will restart in %s",
rank, delay.Round(time.Second))

// Schedule deferred restart
timer := time.AfterFunc(delay, func() {
mgr.log.Noticef("deferred restart triggered for rank %d after rate-limit delay", rank)
mgr.performRestart(ctx, rank, instance)
})

// Overwrite any existing pending restart
mgr.setPendingRestart(rank, timer)
return
}

// Can restart immediately
mgr.performRestart(ctx, rank, instance)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a race here, I think. If multiple restart requests come in at the same time, there's a window where the manager goroutine and the AfterFunc goroutine could both wind up calling performRestart().

If you rewrite this method to make the decision under the lock, I think you eliminate the race and also easily deal with floods of restart requests:

      rank, instance := req.rank, req.instance

      mgr.mu.Lock()
      if last, ok := mgr.lastRestart[rank]; ok {
          if elapsed := time.Since(last); elapsed < mgr.getMinDelay() {
              // Fast debounce for subsequent requests inside of the delay window
              if _, pending := mgr.pendingRestart[rank]; pending {
                  mgr.mu.Unlock()
                  mgr.log.Debugf("rank %d already has a deferred restart pending; dropping",
                      rank)
                  return
              }

              // First restart request inside of the delay window claims it
              remaining := mgr.getMinDelay() - elapsed
              mgr.pendingRestart[rank] = time.AfterFunc(remaining, func() {
                  mgr.requestRestart(rank, instance)
              })
              mgr.mu.Unlock()
              mgr.log.Noticef("rank %d restart rate limited: will restart in %s",
                  rank, remaining.Round(time.Second))
              return
          }
      }
      
      // If this is the first restart or it's outside of the delay window, start the process (over)
      mgr.lastRestart[rank] = time.Now()
      delete(mgr.pendingRestart, rank)
      mgr.mu.Unlock()

      if err := waitForEngineStopped(ctx, []Engine{instance}); err != nil {
          mgr.log.Errorf("rank %d did not stop before restart: %s", rank, err)
          return
      }
      mgr.log.Noticef("restart manager is restarting rank %d", rank)
      instance.requestStart(ctx)

@tanabarr tanabarr requested review from a team as code owners May 14, 2026 15:31
@tanabarr tanabarr force-pushed the tanabarr/control-engine-suicide-restart branch from bba7c65 to 8c14077 Compare May 14, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

8 participants