Make sure we still do writeback on test failure by smklein · Pull Request #10097 · oxidecomputer/omicron

smklein · 2026-03-19T00:48:54Z

Uses SIGTERM for CockroachInstances that are dropped without being cleaned, to propagate ongoing
writes back to durable storage for inspection.

Additionally, adds a test which validates this behavior (torn writes should no longer appear with the new SIGTERM-on-drop implementation).

Fixes #10085

smklein · 2026-03-19T00:56:05Z

test-utils/src/dev/db.rs

+                // process has exited. This lets us poll with a
+                // timeout and fall back to SIGKILL.
+                let deadline = std::time::Instant::now()
+                    + std::time::Duration::from_secs(5);


Open for discussion on how long this should plausibly be. Could be forever, if we don't want to have the SIGKILL fallback below!

ehhh I just removed this for now. One less diverging path, opting to not SIGKILL after SIGTERM.

smklein · 2026-03-19T20:23:11Z

test-utils/src/dev/db.rs

+    //
+    // See: https://github.com/oxidecomputer/omicron/issues/10085
+    #[tokio::test]
+    async fn test_data_survives_drop_without_cleanup() {


This test reliably fails without the changes to drop.

(If you want to verify this: check out this PR, undo the changes to drop, it'll fail. This was a pre-requisite before I started my implementation -- I wanted to reliably hit the issue @davepacheco saw, and have confidence the SIGTERM approach would fix it)

sunshowers · 2026-03-19T23:14:40Z

You may want to wait for Dave to accept this too, but I'm fine with this.

More background. There are three options:

flush WAL buffer on write, then fsync it
flush WAL buffer on write, but do not fsync
do not flush buffers

1 protects against both crdb and the OS crashing, 2 protects against crdb suddenly terminating but not the OS crashing, and 3 does not protect against either.

With crdb, the only options currently available are 1 and 3. 2 is ideal but would require a functional change to CockroachDB (I believe). What we're doing with this change is option 3, plus not suddenly terminating the cockroach process on drop. We would still be susceptible to issues from either the cockroachdb process segfaulting or something else sending it a SIGKILL, though.

davepacheco · 2026-03-20T17:40:09Z

.config/nextest.toml

+# omicron-nexus depends on omicron-test-utils, so this is a bit redundant,
+# but it's future-proofing. omicron-test-utils has conditions where it uses
+# SIGTERM to await CockroachDB graceful shutdown - we use a nextest timeout
+# so it can wait "as long as the test runner decides is reasonable".


I found this confusing only because the first sentence seems parenthetical, not the main thrust of the comment. Take it or leave it (or rewrite it):

Suggested change

# omicron-nexus depends on omicron-test-utils, so this is a bit redundant,

# but it's future-proofing. omicron-test-utils has conditions where it uses

# SIGTERM to await CockroachDB graceful shutdown - we use a nextest timeout

# so it can wait "as long as the test runner decides is reasonable".

# `omicron-test-utils` is included in this block that uses a longer-than-normal timeout because under some conditions it waits for CockroachDB to shut down gracefully. Note that omicron-test-utils currently depends on omicron-nexus, so it's redundant to include both, but we do it for futureproofing.

Re-worded in d659777!

davepacheco · 2026-03-20T17:49:31Z

test-utils/src/dev/db.rs

+
+                // Poll the child_process until it exits.
+                //
+                // Note that nextest has a timeout for tests
+                // with the filter 'rdeps(omicron-test-utils)'.
+                //
+                // As a consequence, we can loop "as long as the test runner
+                // lets us" waiting for CockroachDB to gracefully terminate. In
+                // reality, this is typically on the order of ~one second after
+                // a test failure.
+                while let Ok(None) = child_process.try_wait() {
+                    std::thread::sleep(std::time::Duration::from_millis(50));
+                }


Why add the wait? We weren't waiting before when we hit it with SIGKILL, even though it doesn't exit synchronously -- or even necessarily at all.

(All things being equal, this seems good, but it also seems dicey to do from a Drop handler, especially when run on an async worker thread. I'm worried it would break things -- maybe in some futurelock-type way. I guess we could try it and flush out any such issues. But it seems unrelated to the change here.)

Good question:

Why add the wait? We weren't waiting before when we hit it with SIGKILL, even though it doesn't exit synchronously -- or even necessarily at all.

In the SIGKILL approach, in the "successful cleanup case", we have nothing to wait for - we can kill and move on without any worry.

However, in the "failure case", where we are trying to get CRDB to flush state to storage, we're really waiting for "the on-disk state of the database to match what was written in the test".

So, if we don't wait:

We'll print an error message of "here's how to access the database" while data could still be propagating. In CI, without waiting, I fear that this could mean "we upload a partially written database" to buildomat.

It's harder to test this behavior. As currently written, "drop" returning means "SIGTERM was processed". If we removed that, returning from drop would mean "SIGTERM was sent, but you don't know if/when it finished.

By default, I think nextest uses: https://nexte.st/docs/features/slow-tests/#how-nextest-terminates-tests

On timing out, nextest attempts a graceful shutdown: it first sends the SIGTERM signal to the process group, then waits for a grace period (by default 10 seconds) for the test to shut down. If the test doesn't shut itself down within that time, nextest sends SIGKILL (kill -9) to the process group to terminate it immediately.

So: if we send SIGTERM and exit the test immediately:

We'd print the error message here

Nextest would send SIGTERM to the whole process group (including Cockroach).

The test process would probably return quickly, even if CockroachDB process is still taking a while...

... so the test could "complete" while CockroachDB is still busy writing data.

(All things being equal, this seems good, but it also seems dicey to do from a Drop handler, especially when run on an async worker thread. I'm worried it would break things -- maybe in some futurelock-type way. I guess we could try it and flush out any such issues. But it seems unrelated to the change here.)

I think this is a fair concern, but if it's any consolation - I think this would actually only be a problem if "CockroachDB terminating via SIGTERM" was blocked on "the test process running async work". It might cause the async runtime of the test runner to stutter, but it shouldn't hang forever in this case, as long as the "SIGTERM" means "CRDB will eventually shut down".

For what it's worth, SIGKILL is also not processed synchronously and a process can even hang trying to exit, so it was already true that we may print the message and exit the test while the process was still running. (I hope Cockroach has safeguards against starting a second database while a previous one is running but I don't actually know.) SIGTERM certainly increases the expected length of this period.

The double SIGTERM could be problematic ... I believe some software will abandon the graceful exit if you hit it with SIGTERM a second time.

Fair enough to go with waiting here and hoping for the best. I think the risk is basically a hang, delay, CI runtime impact, or an explicit (if confusing) error message, not like data corruption or a false test success or anything like that.

davepacheco · 2026-03-20T17:50:48Z

test-utils/src/dev/db.rs

+            // write buffer. SIGTERM triggers Pebble's DB.Close() which flushes
+            // and fsyncs the WAL before exiting.


Do you have a source reference for this we could add to the comment? I was under the impression that this was our hope, but wasn't guaranteed by any interface or documentation. It's good to know if we've at least verified by code inspection that it's true.

SIGTERM signal starts drain: https://github.com/oxidecomputer/cockroach/blob/release-22.1-oxide/pkg/cli/start_unix.go#L37 , which eventually calls stopper.Stop()

The "Stopper" closes the storage engine: https://github.com/oxidecomputer/cockroach/blob/release-22.1-oxide/pkg/server/server.go#L205

Which calls db.Close() on Pebble: https://github.com/oxidecomputer/cockroach/blob/release-22.1-oxide/pkg/storage/pebble.go#L1018

Pebble's DB.Close() closes the WAL: https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/db.go#L1121 + https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/db.go#L1157

LogWrite.Close() flushes + fsyncs the WAL: https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/record/log_writer.go#L534

That calls "fsync" if w.s is not nil - which is asserted in https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/record/log_writer.go#L292, when creating the LogWriter.

Cool... that's a bit much to put into the comment but I think it'd be worth providing some reference. Maybe just the first one?

Sure, added a link.

davepacheco · 2026-03-20T17:52:48Z

test-utils/src/dev/db.rs

+        // Phase 1: Write enough data to overflow Pebble's internal WAL
+        // write buffer, ensuring it gets write()-ed to the kernel and
+        // is durable on disk.


Why does the amount of data matter at this step?

I just want to make sure something is stored in-memory, that needs to be sync'd. But it's true that volume actually doesn't matter - I can still get a reliable failure with SIGKILL that doesn't happen with SIGTERM.

Updated in c078b96 to be much simpler.

Okay, if that still fails without the fix then that looks good. (It made sense to me before that you did a ton of work in phase 2 and then wrote the canary. I was only confused why phase 1 also needed to do a lot of work if phase 2 was already doing that.)

Yup, still fails before / passes after the drop changes

davepacheco · 2026-03-20T22:08:03Z

.config/nextest.toml

-# so it can wait "as long as the test runner decides is reasonable".
+# `omicron-test-utils` is included in this block that uses a longer-than-normal
+# timeout because under some conditions it waits for CockroachDB to shut down
+# gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this


Suggested change

# gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this

# gracefully. Note that `omicron-test-utils` depends on `omicron-nexus`, so this

davepacheco · 2026-03-20T22:08:38Z

test-utils/src/dev/db.rs

+            // write buffer. SIGTERM triggers Pebble's DB.Close() which flushes
+            // and fsyncs the WAL before exiting.


Cool... that's a bit much to put into the comment but I think it'd be worth providing some reference. Maybe just the first one?

davepacheco · 2026-03-20T22:10:19Z

test-utils/src/dev/db.rs

+        // Phase 1: Write enough data to overflow Pebble's internal WAL
+        // write buffer, ensuring it gets write()-ed to the kernel and
+        // is durable on disk.


Okay, if that still fails without the fix then that looks good. (It made sense to me before that you did a ton of work in phase 2 and then wrote the canary. I was only confused why phase 1 also needed to do a lot of work if phase 2 was already doing that.)

Make sure we still do writeback on test failure

46c05ef

smklein commented Mar 19, 2026

View reviewed changes

smklein marked this pull request as ready for review March 19, 2026 17:12

smklein added 2 commits March 19, 2026 13:05

simplify waiting

88502d1

update comment

7efbc95

smklein requested review from davepacheco and sunshowers March 19, 2026 20:21

smklein commented Mar 19, 2026

View reviewed changes

smklein added 2 commits March 19, 2026 14:30

nextest configs

4f19262

clippy

c9489c0

sunshowers approved these changes Mar 19, 2026

View reviewed changes

smklein enabled auto-merge (squash) March 19, 2026 23:13

smklein disabled auto-merge March 19, 2026 23:14

davepacheco reviewed Mar 20, 2026

View reviewed changes

smklein added 2 commits March 20, 2026 14:27

comments

d659777

simplify test

c078b96

davepacheco reviewed Mar 20, 2026

View reviewed changes

comments

bb01bbf

		// write buffer. SIGTERM triggers Pebble's DB.Close() which flushes
		// and fsyncs the WAL before exiting.

	# gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this
	# gracefully. Note that `omicron-test-utils` depends on `omicron-nexus`, so this

Conversation

smklein commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smklein commented Mar 19, 2026 •

edited

Loading

sunshowers commented Mar 19, 2026 •

edited

Loading