Skip to content

Add e2e test for kernel recovery after external kill#13694

Draft
rodrigosf672 wants to merge 1 commit into
mainfrom
test/notebook-kernel-external-kill
Draft

Add e2e test for kernel recovery after external kill#13694
rodrigosf672 wants to merge 1 commit into
mainfrom
test/notebook-kernel-external-kill

Conversation

@rodrigosf672
Copy link
Copy Markdown
Member

@rodrigosf672 rodrigosf672 commented May 21, 2026

Summary

  • Adds a Playwright e2e test that kills a Python kernel process with SIGKILL (simulating OOM) and verifies Positron detects the death and allows restart
  • Confirms the kernel supervisor correctly propagates unexpected process termination to the UI

Plan

  • Test passes locally (ran twice, 8.1s and 8.4s)
  • Explored adding this as unit test instead (not feasible)
  • CI green on Linux/Electron (only, to prevent excessive CI workload)

@:positron-notebooks

Verifies that Positron detects an unexpected kernel death (simulating
OOM kill via SIGKILL) and allows the user to restart and resume work.

Closes #12869
@rodrigosf672 rodrigosf672 added area: notebooks-jupyter Issues related to Jupyter .ipynb support area: remote host labels May 21, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

E2E Tests 🚀
This PR will run tests tagged with: @:critical @:positron-notebooks

readme  valid tags

@jonvanausdeln
Copy link
Copy Markdown
Contributor

I wonder if this might be too much of an edge case right now for an e2e test? Anyone else have thoughts?

@testlabauto
Copy link
Copy Markdown
Contributor

I wonder if this might be too much of an edge case right now for an e2e test? Anyone else have thoughts?

I would agree. Calling kill from a test just seems rather extreme.

@rodrigosf672
Copy link
Copy Markdown
Member Author

The test here is trying to mimick a realistic failure when the kernel is unexpectedly terminated e.g., because it was OOM-killed (for notebooks, that could happen due to large datasets, long running processes, etc.).

In that situation, I was thinking how we could best and simply test Positron’s recovery behavior (and calling kill is not the test itself, just a means of triggering the recovery behavior). The question this test tries to answer is: does Positron detect that the kernel died, leave the notebook in a clear state, and allow the user to restart and continue working?

I thought this was an important notebook coverage gap, especially for users working with large datasets or long-running notebook sessions. Do we currently cover unexpected kernel termination and recovery anywhere else?

@testlabauto
Copy link
Copy Markdown
Contributor

The test here is trying to mimick a realistic failure when the kernel is unexpectedly terminated e.g., because it was OOM-killed (for notebooks, that could happen due to large datasets, long running processes, etc.).

In that situation, I was thinking how we could best and simply test Positron’s recovery behavior (and calling kill is not the test itself, just a means of triggering the recovery behavior). The question this test tries to answer is: does Positron detect that the kernel died, leave the notebook in a clear state, and allow the user to restart and continue working?

I thought this was an important notebook coverage gap, especially for users working with large datasets or long-running notebook sessions. Do we currently cover unexpected kernel termination and recovery anywhere else?

I guess it is technically not covered yet. If you feel strongly that it is worthwhile, please run a very complete full suite run (covering all OSs and types) and make sure it is not an issue.

@juliasilge
Copy link
Copy Markdown
Member

Thanks for digging into this, Rodrigo! I do want to push back on the placement of this test and suggest a better home. Some of the concerns with this here are:

  1. This doesn't exercise the scenario from Explore what happens when kernel(s) shut(-s) down while port-forwarding via remote ssh #12869 that is linked. That issue is about kernel death over a port-forwarded remote SSH session. This test runs everything locally, so the transport-layer failure mode (forwarded port collapses while the supervisor stays on the host) isn't covered. The PR title says OOM, but a local SIGKILL is not really an OOM simulation, and OOM-over-remote behaves differently again.

  2. UI-level duplication of notebook-kernel-behavior.test.ts. That test already covers shutdown -> disconnected -> restart -> idle. From the kernel UI's point of view, the only delta here is the trigger (external SIGKILL vs. the Shutdown menu item). The supervisor's mapping of process-exit to the exited/disconnected state doesn't branch on cause at this UI layer.

  3. This is platform-narrow and brittle in ways that add CI cost. execSync('kill -9 ...') is Linux/macOS only, hence the Linux/Electron restriction. The PID is parsed out of cell output with parseInt(outputText!.trim(), 10), which couples a kernel-lifecycle assertion to notebook output formatting. The 30s timeout on disconnected detection is a likely flake source.

Instead, we could take a different approach! The behavior actually being validated here is that Kallichore detects an unexpected child-process death and propagates Exited to its client. That logic is in kcserver/src/kernel_session/process.rs::run_child, which calls child.wait().await and then emits KernelMessage::Exited(code) over the WebSocket. Kallichore is also where the test gap is more real; over there, none of the existing tests SIGKILL the kernel process IIUC. test_kernel_session_shutdown covers the graceful path via shutdown_request, but there's no companion test for unsolicited death.

A Kallichore-side test that:

  • spawns a real Python kernel (the test harness in python_kernel_tests.rs already does this),
  • captures the kernel PID,
  • sends SIGKILL to that PID,
  • asserts the KernelMessage::Exited WebSocket event fires and session status transitions to Exited,

would cover the same intent at a fraction of the runtime, with deterministic event-based assertions instead of UI text parsing, and on the layer where a regression would actually originate. It would also fit naturally next to test_kernel_session_shutdown as its asymmetric counterpart!

For the remote SSH angle from the original issue, that genuinely does want an integration-level test, but it'd need to involve a forwarded port and either a real or stubbed SSH transport, which is a different scope from this PR.

How about we set up some plans for a more appropriate test approach here?

@rodrigosf672 rodrigosf672 marked this pull request as draft May 28, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: notebooks-jupyter Issues related to Jupyter .ipynb support area: remote host

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants