Add e2e test for kernel recovery after external kill by rodrigosf672 · Pull Request #13694 · posit-dev/positron

rodrigosf672 · 2026-05-21T18:51:42Z

Summary

Adds a Playwright e2e test that kills a Python kernel process with SIGKILL (simulating OOM) and verifies Positron detects the death and allows restart
Confirms the kernel supervisor correctly propagates unexpected process termination to the UI

Plan

Test passes locally (ran twice, 8.1s and 8.4s)
Explored adding this as unit test instead (not feasible)
CI green on Linux/Electron (only, to prevent excessive CI workload)

@:positron-notebooks

Verifies that Positron detects an unexpected kernel death (simulating OOM kill via SIGKILL) and allows the user to restart and resume work. Closes #12869

github-actions · 2026-05-21T18:52:01Z

E2E Tests 🚀
This PR will run tests tagged with: @:critical @:positron-notebooks

^readme ^{valid tags}

jonvanausdeln · 2026-05-21T19:30:23Z

I wonder if this might be too much of an edge case right now for an e2e test? Anyone else have thoughts?

testlabauto · 2026-05-21T19:37:12Z

I wonder if this might be too much of an edge case right now for an e2e test? Anyone else have thoughts?

I would agree. Calling kill from a test just seems rather extreme.

rodrigosf672 · 2026-05-21T19:59:16Z

The test here is trying to mimick a realistic failure when the kernel is unexpectedly terminated e.g., because it was OOM-killed (for notebooks, that could happen due to large datasets, long running processes, etc.).

In that situation, I was thinking how we could best and simply test Positron’s recovery behavior (and calling kill is not the test itself, just a means of triggering the recovery behavior). The question this test tries to answer is: does Positron detect that the kernel died, leave the notebook in a clear state, and allow the user to restart and continue working?

I thought this was an important notebook coverage gap, especially for users working with large datasets or long-running notebook sessions. Do we currently cover unexpected kernel termination and recovery anywhere else?

testlabauto · 2026-05-21T21:00:17Z

The test here is trying to mimick a realistic failure when the kernel is unexpectedly terminated e.g., because it was OOM-killed (for notebooks, that could happen due to large datasets, long running processes, etc.).

In that situation, I was thinking how we could best and simply test Positron’s recovery behavior (and calling kill is not the test itself, just a means of triggering the recovery behavior). The question this test tries to answer is: does Positron detect that the kernel died, leave the notebook in a clear state, and allow the user to restart and continue working?

I thought this was an important notebook coverage gap, especially for users working with large datasets or long-running notebook sessions. Do we currently cover unexpected kernel termination and recovery anywhere else?

I guess it is technically not covered yet. If you feel strongly that it is worthwhile, please run a very complete full suite run (covering all OSs and types) and make sure it is not an issue.

juliasilge · 2026-05-21T21:12:11Z

Thanks for digging into this, Rodrigo! I do want to push back on the placement of this test and suggest a better home. Some of the concerns with this here are:

This doesn't exercise the scenario from Explore what happens when kernel(s) shut(-s) down while port-forwarding via remote ssh #12869 that is linked. That issue is about kernel death over a port-forwarded remote SSH session. This test runs everything locally, so the transport-layer failure mode (forwarded port collapses while the supervisor stays on the host) isn't covered. The PR title says OOM, but a local SIGKILL is not really an OOM simulation, and OOM-over-remote behaves differently again.
UI-level duplication of notebook-kernel-behavior.test.ts. That test already covers shutdown -> disconnected -> restart -> idle. From the kernel UI's point of view, the only delta here is the trigger (external SIGKILL vs. the Shutdown menu item). The supervisor's mapping of process-exit to the exited/disconnected state doesn't branch on cause at this UI layer.
This is platform-narrow and brittle in ways that add CI cost. execSync('kill -9 ...') is Linux/macOS only, hence the Linux/Electron restriction. The PID is parsed out of cell output with parseInt(outputText!.trim(), 10), which couples a kernel-lifecycle assertion to notebook output formatting. The 30s timeout on disconnected detection is a likely flake source.

Instead, we could take a different approach! The behavior actually being validated here is that Kallichore detects an unexpected child-process death and propagates Exited to its client. That logic is in kcserver/src/kernel_session/process.rs::run_child, which calls child.wait().await and then emits KernelMessage::Exited(code) over the WebSocket. Kallichore is also where the test gap is more real; over there, none of the existing tests SIGKILL the kernel process IIUC. test_kernel_session_shutdown covers the graceful path via shutdown_request, but there's no companion test for unsolicited death.

A Kallichore-side test that:

spawns a real Python kernel (the test harness in python_kernel_tests.rs already does this),
captures the kernel PID,
sends SIGKILL to that PID,
asserts the KernelMessage::Exited WebSocket event fires and session status transitions to Exited,

would cover the same intent at a fraction of the runtime, with deterministic event-based assertions instead of UI text parsing, and on the layer where a regression would actually originate. It would also fit naturally next to test_kernel_session_shutdown as its asymmetric counterpart!

For the remote SSH angle from the original issue, that genuinely does want an integration-level test, but it'd need to involve a forwarded port and either a real or stubbed SSH transport, which is a different scope from this PR.

How about we set up some plans for a more appropriate test approach here?

Add e2e test for kernel recovery after external kill

88ab398

Verifies that Positron detects an unexpected kernel death (simulating OOM kill via SIGKILL) and allows the user to restart and resume work. Closes #12869

rodrigosf672 added area: notebooks-jupyter Issues related to Jupyter .ipynb support area: remote host labels May 21, 2026

rodrigosf672 mentioned this pull request May 21, 2026

Explore what happens when kernel(s) shut(-s) down while port-forwarding via remote ssh #12869

Closed

rodrigosf672 requested review from jonvanausdeln, midleman and testlabauto May 21, 2026 19:01

rodrigosf672 mentioned this pull request May 21, 2026

e2e test coverage for unexpected kernel process termination #13695

Open

rodrigosf672 marked this pull request as draft May 28, 2026 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add e2e test for kernel recovery after external kill#13694

Add e2e test for kernel recovery after external kill#13694
rodrigosf672 wants to merge 1 commit into
mainfrom
test/notebook-kernel-external-kill

rodrigosf672 commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

jonvanausdeln commented May 21, 2026

Uh oh!

testlabauto commented May 21, 2026

Uh oh!

rodrigosf672 commented May 21, 2026

Uh oh!

testlabauto commented May 21, 2026

Uh oh!

juliasilge commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rodrigosf672 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Plan

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonvanausdeln commented May 21, 2026

Uh oh!

testlabauto commented May 21, 2026

Uh oh!

rodrigosf672 commented May 21, 2026

Uh oh!

testlabauto commented May 21, 2026

Uh oh!

juliasilge commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rodrigosf672 commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading