DAOS-17427 control: Restart excluded rank after suicide#16279
DAOS-17427 control: Restart excluded rank after suicide#16279tanabarr wants to merge 55 commits into
Conversation
|
Ticket title is 'Handle engine suicides by automatically restarting the engines' |
|
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-16279/1/execution/node/1466/log |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
f1a124f to
e57b0d3
Compare
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
e57b0d3 to
af7f056
Compare
| D_ERROR("failed to handle %u/%u event: " DF_RC "\n", src, type, | ||
| DP_RC(rc)); | ||
|
|
||
| rc = kill(getpid(), SIGKILL); |
There was a problem hiding this comment.
[Discussion] Looking at the code changes I think I'm back to the previously discussed point: Whether the engine kills itself, or only informs the server, who will decide what to do. If the engine kills itself, why do we need the RAS event... The server can simply decide whether to restart a killed engine, even without the RAS event, cannot it...
[Question] If the ds_notify_rank_suicide call fails, then the engine will still terminate, but will the server restart the engine?
There was a problem hiding this comment.
[Discussion] I implemented kill+ras mainly because it is guaranteed that the engine process will be killed, the RAS event signifies that it is a suicide as opposed to other termination, do we really want to always restart a terminated rank regardless of reason?
[Question] Yes, if the notify call fails then the engine will be terminated but not restarted. if instead we don't terminate the engine on group map change and just send ras and put the engine into a blocking state then if a notify call fails then the engine will effectively hang.
There was a problem hiding this comment.
do we really want to always restart a terminated rank regardless of reason?
Hmm, it's indeed hard to say for sure. I'd understand if we'd like to begin conservatively, only restarting for certain cases.
There was a problem hiding this comment.
I agree we should start conservatively. For engines that crash, or exit for some unknown reason, we don't know if the engine would be OK if we restarted it. Better to let the admin investigate and manually restart the ranks in that case.
|
@tanabarr, I had the same issue with the CI regarding the "Unit test beds with memcheck". |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control pool Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/5/execution/node/1303/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16279/5/testReport/ |
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
|
conflicts resolved, doc-only change, CI run no. 5 is still relevant and should be used for PR review |
…gine-suicide-restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
daltonbohning
left a comment
There was a problem hiding this comment.
It seems the new functional tests aren't even running in CI?
There are many lint errors in ftest that need to be resolved or possibly ignored
| def tearDown(self): | ||
| """Clean up after each test method.""" | ||
| # Reset restart state for next test method | ||
| # This ensures clean state between sequential tests | ||
| try: | ||
| self.reset_engine_restart_state() | ||
| except Exception as error: | ||
| self.log.error("Failed to reset engine restart state: %s", error) | ||
| self.fail("tearDown failed to reset engine restart state: {}".format(error)) | ||
| finally: | ||
| super().tearDown() |
There was a problem hiding this comment.
We have a way to handle this in the framework by calling
self.register_cleanup(reset_engine_restart_state)
After whatever operation puts the system into the invalid state. So maybe instead of defining this tearDown, we can do that after calling exclude_rank_and_wait_restart the first time
| """ | ||
| all_ranks = self.get_all_ranks() | ||
| if len(all_ranks) < 2: | ||
| self.skipTest("Test requires at least 2 ranks") |
There was a problem hiding this comment.
It is better to fail because skipping will be silent and easily ignored
| self.skipTest("Test requires at least 2 ranks") | |
| self.fail("Test requires at least 2 ranks") |
| final_incarnation = self.get_rank_incarnation(test_rank) | ||
| if final_incarnation is None: | ||
| self.fail(f"failed to get final incarnation for rank {test_rank}") |
There was a problem hiding this comment.
It would be better if get_rank_incarnation raised an exception instead of silently returning None
| self.server_managers[0].system_stop() | ||
| time.sleep(2) | ||
| self.server_managers[0].system_start() |
There was a problem hiding this comment.
Where does the arbitrary 2s sleep come from? This will eventually be a problem and we will have to revisit
There was a problem hiding this comment.
will remove if you don't think it necessary for a system restart
There was a problem hiding this comment.
If we don't need it at all then that is great, but I was more meaning that 2s seems arbitrary. And historically arbitrary sleeps have been an issue that we have to revisit later
Test-tag: pr control full_regression Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16279/28/execution/node/942/log |
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com> Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…ndFailure exception for helpers and register cleanup in setUp for each test class Test-tag: hw,medium,dmg,control,engine_auto_restart Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…gine-suicide-restart Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
| self.log_step(f"Waiting for rank {rank} to self-terminate") | ||
| time.sleep(2) | ||
|
|
||
| # Check if rank is adminexcluded |
There was a problem hiding this comment.
Similar to some other cases: how do we know 2s is enough? If there really is not a deterministic way to know at this code level, this is fine, but eventually this kind of thing needs to be revisited.
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16279/33/display/redirect |
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Co-authored-by: Dalton Bohning <dalton.bohning@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Test-tag: hw,medium,dmg,control,engine_auto_restart pr Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
There was a problem hiding this comment.
Should wait_for_resp be true now, in order to ensure that the event processing doesn't race with the SIGKILL? I don't really see a downside...
| rank := req.rank | ||
| instance := req.instance | ||
|
|
||
| mgr.log.Debugf("processing restart request for rank %d", rank) | ||
|
|
||
| canRestart, delay := mgr.canRestartNow(rank) | ||
| if !canRestart { | ||
| mgr.log.Noticef("rank %d restart rate limited: will restart in %s", | ||
| rank, delay.Round(time.Second)) | ||
|
|
||
| // Schedule deferred restart | ||
| timer := time.AfterFunc(delay, func() { | ||
| mgr.log.Noticef("deferred restart triggered for rank %d after rate-limit delay", rank) | ||
| mgr.performRestart(ctx, rank, instance) | ||
| }) | ||
|
|
||
| // Overwrite any existing pending restart | ||
| mgr.setPendingRestart(rank, timer) | ||
| return | ||
| } | ||
|
|
||
| // Can restart immediately | ||
| mgr.performRestart(ctx, rank, instance) |
There was a problem hiding this comment.
There is still a race here, I think. If multiple restart requests come in at the same time, there's a window where the manager goroutine and the AfterFunc goroutine could both wind up calling performRestart().
If you rewrite this method to make the decision under the lock, I think you eliminate the race and also easily deal with floods of restart requests:
rank, instance := req.rank, req.instance
mgr.mu.Lock()
if last, ok := mgr.lastRestart[rank]; ok {
if elapsed := time.Since(last); elapsed < mgr.getMinDelay() {
// Fast debounce for subsequent requests inside of the delay window
if _, pending := mgr.pendingRestart[rank]; pending {
mgr.mu.Unlock()
mgr.log.Debugf("rank %d already has a deferred restart pending; dropping",
rank)
return
}
// First restart request inside of the delay window claims it
remaining := mgr.getMinDelay() - elapsed
mgr.pendingRestart[rank] = time.AfterFunc(remaining, func() {
mgr.requestRestart(rank, instance)
})
mgr.mu.Unlock()
mgr.log.Noticef("rank %d restart rate limited: will restart in %s",
rank, remaining.Round(time.Second))
return
}
}
// If this is the first restart or it's outside of the delay window, start the process (over)
mgr.lastRestart[rank] = time.Now()
delete(mgr.pendingRestart, rank)
mgr.mu.Unlock()
if err := waitForEngineStopped(ctx, []Engine{instance}); err != nil {
mgr.log.Errorf("rank %d did not stop before restart: %s", rank, err)
return
}
mgr.log.Noticef("restart manager is restarting rank %d", rank)
instance.requestStart(ctx)bba7c65 to
8c14077
Compare
When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.
The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.
Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.
Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.
Features: pool control
Steps for the author:
After all prior steps are complete: