Skip to content

Make sure we still do writeback on test failure#10097

Open
smklein wants to merge 8 commits intomainfrom
baby-write-back
Open

Make sure we still do writeback on test failure#10097
smklein wants to merge 8 commits intomainfrom
baby-write-back

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Mar 19, 2026

Uses SIGTERM for CockroachInstances that are dropped without being cleaned, to propagate ongoing
writes back to durable storage for inspection.

Additionally, adds a test which validates this behavior (torn writes should no longer appear with the new SIGTERM-on-drop implementation).

Fixes #10085

// process has exited. This lets us poll with a
// timeout and fall back to SIGKILL.
let deadline = std::time::Instant::now()
+ std::time::Duration::from_secs(5);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open for discussion on how long this should plausibly be. Could be forever, if we don't want to have the SIGKILL fallback below!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehhh I just removed this for now. One less diverging path, opting to not SIGKILL after SIGTERM.

@smklein smklein marked this pull request as ready for review March 19, 2026 17:12
//
// See: https://github.com/oxidecomputer/omicron/issues/10085
#[tokio::test]
async fn test_data_survives_drop_without_cleanup() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test reliably fails without the changes to drop.

(If you want to verify this: check out this PR, undo the changes to drop, it'll fail. This was a pre-requisite before I started my implementation -- I wanted to reliably hit the issue @davepacheco saw, and have confidence the SIGTERM approach would fix it)

@smklein smklein enabled auto-merge (squash) March 19, 2026 23:13
@sunshowers
Copy link
Contributor

sunshowers commented Mar 19, 2026

You may want to wait for Dave to accept this too, but I'm fine with this.

More background. There are three options:

  1. flush WAL buffer on write, then fsync it
  2. flush WAL buffer on write, but do not fsync
  3. do not flush buffers

1 protects against both crdb and the OS crashing, 2 protects against crdb suddenly terminating but not the OS crashing, and 3 does not protect against either.

With crdb, the only options currently available are 1 and 3. 2 is ideal but would require a functional change to CockroachDB (I believe). What we're doing with this change is option 3, plus not suddenly terminating the cockroach process on drop. We would still be susceptible to issues from either the cockroachdb process segfaulting or something else sending it a SIGKILL, though.

@smklein smklein disabled auto-merge March 19, 2026 23:14
Comment on lines +70 to +73
# omicron-nexus depends on omicron-test-utils, so this is a bit redundant,
# but it's future-proofing. omicron-test-utils has conditions where it uses
# SIGTERM to await CockroachDB graceful shutdown - we use a nextest timeout
# so it can wait "as long as the test runner decides is reasonable".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this confusing only because the first sentence seems parenthetical, not the main thrust of the comment. Take it or leave it (or rewrite it):

Suggested change
# omicron-nexus depends on omicron-test-utils, so this is a bit redundant,
# but it's future-proofing. omicron-test-utils has conditions where it uses
# SIGTERM to await CockroachDB graceful shutdown - we use a nextest timeout
# so it can wait "as long as the test runner decides is reasonable".
# `omicron-test-utils` is included in this block that uses a longer-than-normal timeout because under some conditions it waits for CockroachDB to shut down gracefully. Note that omicron-test-utils currently depends on omicron-nexus, so it's redundant to include both, but we do it for futureproofing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-worded in d659777!

Comment on lines +769 to +781

// Poll the child_process until it exits.
//
// Note that nextest has a timeout for tests
// with the filter 'rdeps(omicron-test-utils)'.
//
// As a consequence, we can loop "as long as the test runner
// lets us" waiting for CockroachDB to gracefully terminate. In
// reality, this is typically on the order of ~one second after
// a test failure.
while let Ok(None) = child_process.try_wait() {
std::thread::sleep(std::time::Duration::from_millis(50));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the wait? We weren't waiting before when we hit it with SIGKILL, even though it doesn't exit synchronously -- or even necessarily at all.

(All things being equal, this seems good, but it also seems dicey to do from a Drop handler, especially when run on an async worker thread. I'm worried it would break things -- maybe in some futurelock-type way. I guess we could try it and flush out any such issues. But it seems unrelated to the change here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question:

Why add the wait? We weren't waiting before when we hit it with SIGKILL, even though it doesn't exit synchronously -- or even necessarily at all.

  • In the SIGKILL approach, in the "successful cleanup case", we have nothing to wait for - we can kill and move on without any worry.
  • However, in the "failure case", where we are trying to get CRDB to flush state to storage, we're really waiting for "the on-disk state of the database to match what was written in the test".

So, if we don't wait:

  • We'll print an error message of "here's how to access the database" while data could still be propagating. In CI, without waiting, I fear that this could mean "we upload a partially written database" to buildomat.
  • It's harder to test this behavior. As currently written, "drop" returning means "SIGTERM was processed". If we removed that, returning from drop would mean "SIGTERM was sent, but you don't know if/when it finished.

By default, I think nextest uses: https://nexte.st/docs/features/slow-tests/#how-nextest-terminates-tests

On timing out, nextest attempts a graceful shutdown: it first sends the SIGTERM signal
to the process group, then waits for a grace period (by default 10 seconds) for the test
to shut down. If the test doesn't shut itself down within that time, nextest sends
SIGKILL (kill -9) to the process group to terminate it immediately.

So: if we send SIGTERM and exit the test immediately:

  • We'd print the error message here
  • Nextest would send SIGTERM to the whole process group (including Cockroach).
  • The test process would probably return quickly, even if CockroachDB process is still taking a while...
  • ... so the test could "complete" while CockroachDB is still busy writing data.

(All things being equal, this seems good, but it also seems dicey to do from a Drop handler, especially when run on an async worker thread. I'm worried it would break things -- maybe in some futurelock-type way. I guess we could try it and flush out any such issues. But it seems unrelated to the change here.)

I think this is a fair concern, but if it's any consolation - I think this would actually only be a problem if "CockroachDB terminating via SIGTERM" was blocked on "the test process running async work". It might cause the async runtime of the test runner to stutter, but it shouldn't hang forever in this case, as long as the "SIGTERM" means "CRDB will eventually shut down".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, SIGKILL is also not processed synchronously and a process can even hang trying to exit, so it was already true that we may print the message and exit the test while the process was still running. (I hope Cockroach has safeguards against starting a second database while a previous one is running but I don't actually know.) SIGTERM certainly increases the expected length of this period.

The double SIGTERM could be problematic ... I believe some software will abandon the graceful exit if you hit it with SIGTERM a second time.

Fair enough to go with waiting here and hoping for the best. I think the risk is basically a hang, delay, CI runtime impact, or an explicit (if confusing) error message, not like data corruption or a false test success or anything like that.

Comment on lines +763 to +764
// write buffer. SIGTERM triggers Pebble's DB.Close() which flushes
// and fsyncs the WAL before exiting.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a source reference for this we could add to the comment? I was under the impression that this was our hope, but wasn't guaranteed by any interface or documentation. It's good to know if we've at least verified by code inspection that it's true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool... that's a bit much to put into the comment but I think it'd be worth providing some reference. Maybe just the first one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a link.

Comment on lines +1965 to +1967
// Phase 1: Write enough data to overflow Pebble's internal WAL
// write buffer, ensuring it gets write()-ed to the kernel and
// is durable on disk.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the amount of data matter at this step?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure something is stored in-memory, that needs to be sync'd. But it's true that volume actually doesn't matter - I can still get a reliable failure with SIGKILL that doesn't happen with SIGTERM.

Updated in c078b96 to be much simpler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if that still fails without the fix then that looks good. (It made sense to me before that you did a ton of work in phase 2 and then wrote the canary. I was only confused why phase 1 also needed to do a lot of work if phase 2 was already doing that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, still fails before / passes after the drop changes

# so it can wait "as long as the test runner decides is reasonable".
# `omicron-test-utils` is included in this block that uses a longer-than-normal
# timeout because under some conditions it waits for CockroachDB to shut down
# gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this
# gracefully. Note that `omicron-test-utils` depends on `omicron-nexus`, so this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines +763 to +764
// write buffer. SIGTERM triggers Pebble's DB.Close() which flushes
// and fsyncs the WAL before exiting.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool... that's a bit much to put into the comment but I think it'd be worth providing some reference. Maybe just the first one?

Comment on lines +1965 to +1967
// Phase 1: Write enough data to overflow Pebble's internal WAL
// write buffer, ensuring it gets write()-ed to the kernel and
// is durable on disk.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if that still fails without the fix then that looks good. (It made sense to me before that you did a ton of work in phase 2 and then wrote the canary. I was only confused why phase 1 also needed to do a lot of work if phase 2 was already doing that.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CockroachDB's disable_synchronization_unsafe in tests causes unsynchronized writes

3 participants