Make sure we still do writeback on test failure#10097
Conversation
test-utils/src/dev/db.rs
Outdated
| // process has exited. This lets us poll with a | ||
| // timeout and fall back to SIGKILL. | ||
| let deadline = std::time::Instant::now() | ||
| + std::time::Duration::from_secs(5); |
There was a problem hiding this comment.
Open for discussion on how long this should plausibly be. Could be forever, if we don't want to have the SIGKILL fallback below!
There was a problem hiding this comment.
ehhh I just removed this for now. One less diverging path, opting to not SIGKILL after SIGTERM.
| // | ||
| // See: https://github.com/oxidecomputer/omicron/issues/10085 | ||
| #[tokio::test] | ||
| async fn test_data_survives_drop_without_cleanup() { |
There was a problem hiding this comment.
This test reliably fails without the changes to drop.
(If you want to verify this: check out this PR, undo the changes to drop, it'll fail. This was a pre-requisite before I started my implementation -- I wanted to reliably hit the issue @davepacheco saw, and have confidence the SIGTERM approach would fix it)
|
You may want to wait for Dave to accept this too, but I'm fine with this. More background. There are three options:
1 protects against both crdb and the OS crashing, 2 protects against crdb suddenly terminating but not the OS crashing, and 3 does not protect against either. With crdb, the only options currently available are 1 and 3. 2 is ideal but would require a functional change to CockroachDB (I believe). What we're doing with this change is option 3, plus not suddenly terminating the cockroach process on drop. We would still be susceptible to issues from either the cockroachdb process segfaulting or something else sending it a SIGKILL, though. |
.config/nextest.toml
Outdated
| # omicron-nexus depends on omicron-test-utils, so this is a bit redundant, | ||
| # but it's future-proofing. omicron-test-utils has conditions where it uses | ||
| # SIGTERM to await CockroachDB graceful shutdown - we use a nextest timeout | ||
| # so it can wait "as long as the test runner decides is reasonable". |
There was a problem hiding this comment.
I found this confusing only because the first sentence seems parenthetical, not the main thrust of the comment. Take it or leave it (or rewrite it):
| # omicron-nexus depends on omicron-test-utils, so this is a bit redundant, | |
| # but it's future-proofing. omicron-test-utils has conditions where it uses | |
| # SIGTERM to await CockroachDB graceful shutdown - we use a nextest timeout | |
| # so it can wait "as long as the test runner decides is reasonable". | |
| # `omicron-test-utils` is included in this block that uses a longer-than-normal timeout because under some conditions it waits for CockroachDB to shut down gracefully. Note that omicron-test-utils currently depends on omicron-nexus, so it's redundant to include both, but we do it for futureproofing. |
|
|
||
| // Poll the child_process until it exits. | ||
| // | ||
| // Note that nextest has a timeout for tests | ||
| // with the filter 'rdeps(omicron-test-utils)'. | ||
| // | ||
| // As a consequence, we can loop "as long as the test runner | ||
| // lets us" waiting for CockroachDB to gracefully terminate. In | ||
| // reality, this is typically on the order of ~one second after | ||
| // a test failure. | ||
| while let Ok(None) = child_process.try_wait() { | ||
| std::thread::sleep(std::time::Duration::from_millis(50)); | ||
| } |
There was a problem hiding this comment.
Why add the wait? We weren't waiting before when we hit it with SIGKILL, even though it doesn't exit synchronously -- or even necessarily at all.
(All things being equal, this seems good, but it also seems dicey to do from a Drop handler, especially when run on an async worker thread. I'm worried it would break things -- maybe in some futurelock-type way. I guess we could try it and flush out any such issues. But it seems unrelated to the change here.)
There was a problem hiding this comment.
Good question:
Why add the wait? We weren't waiting before when we hit it with SIGKILL, even though it doesn't exit synchronously -- or even necessarily at all.
- In the SIGKILL approach, in the "successful cleanup case", we have nothing to wait for - we can kill and move on without any worry.
- However, in the "failure case", where we are trying to get CRDB to flush state to storage, we're really waiting for "the on-disk state of the database to match what was written in the test".
So, if we don't wait:
- We'll print an error message of "here's how to access the database" while data could still be propagating. In CI, without waiting, I fear that this could mean "we upload a partially written database" to buildomat.
- It's harder to test this behavior. As currently written, "drop" returning means "SIGTERM was processed". If we removed that, returning from drop would mean "SIGTERM was sent, but you don't know if/when it finished.
By default, I think nextest uses: https://nexte.st/docs/features/slow-tests/#how-nextest-terminates-tests
On timing out, nextest attempts a graceful shutdown: it first sends the SIGTERM signal
to the process group, then waits for a grace period (by default 10 seconds) for the test
to shut down. If the test doesn't shut itself down within that time, nextest sends
SIGKILL (kill -9) to the process group to terminate it immediately.
So: if we send SIGTERM and exit the test immediately:
- We'd print the error message here
- Nextest would send SIGTERM to the whole process group (including Cockroach).
- The test process would probably return quickly, even if CockroachDB process is still taking a while...
- ... so the test could "complete" while CockroachDB is still busy writing data.
(All things being equal, this seems good, but it also seems dicey to do from a Drop handler, especially when run on an async worker thread. I'm worried it would break things -- maybe in some futurelock-type way. I guess we could try it and flush out any such issues. But it seems unrelated to the change here.)
I think this is a fair concern, but if it's any consolation - I think this would actually only be a problem if "CockroachDB terminating via SIGTERM" was blocked on "the test process running async work". It might cause the async runtime of the test runner to stutter, but it shouldn't hang forever in this case, as long as the "SIGTERM" means "CRDB will eventually shut down".
There was a problem hiding this comment.
For what it's worth, SIGKILL is also not processed synchronously and a process can even hang trying to exit, so it was already true that we may print the message and exit the test while the process was still running. (I hope Cockroach has safeguards against starting a second database while a previous one is running but I don't actually know.) SIGTERM certainly increases the expected length of this period.
The double SIGTERM could be problematic ... I believe some software will abandon the graceful exit if you hit it with SIGTERM a second time.
Fair enough to go with waiting here and hoping for the best. I think the risk is basically a hang, delay, CI runtime impact, or an explicit (if confusing) error message, not like data corruption or a false test success or anything like that.
| // write buffer. SIGTERM triggers Pebble's DB.Close() which flushes | ||
| // and fsyncs the WAL before exiting. |
There was a problem hiding this comment.
Do you have a source reference for this we could add to the comment? I was under the impression that this was our hope, but wasn't guaranteed by any interface or documentation. It's good to know if we've at least verified by code inspection that it's true.
There was a problem hiding this comment.
- SIGTERM signal starts drain: https://github.com/oxidecomputer/cockroach/blob/release-22.1-oxide/pkg/cli/start_unix.go#L37 , which eventually calls
stopper.Stop() - The "Stopper" closes the storage engine: https://github.com/oxidecomputer/cockroach/blob/release-22.1-oxide/pkg/server/server.go#L205
- Which calls
db.Close()on Pebble: https://github.com/oxidecomputer/cockroach/blob/release-22.1-oxide/pkg/storage/pebble.go#L1018 - Pebble's
DB.Close()closes the WAL: https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/db.go#L1121 + https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/db.go#L1157 LogWrite.Close()flushes + fsyncs the WAL: https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/record/log_writer.go#L534
That calls "fsync" if w.s is not nil - which is asserted in https://github.com/oxidecomputer/pebble/blob/06125d28ca5b/record/log_writer.go#L292, when creating the LogWriter.
There was a problem hiding this comment.
Cool... that's a bit much to put into the comment but I think it'd be worth providing some reference. Maybe just the first one?
There was a problem hiding this comment.
Sure, added a link.
test-utils/src/dev/db.rs
Outdated
| // Phase 1: Write enough data to overflow Pebble's internal WAL | ||
| // write buffer, ensuring it gets write()-ed to the kernel and | ||
| // is durable on disk. |
There was a problem hiding this comment.
Why does the amount of data matter at this step?
There was a problem hiding this comment.
I just want to make sure something is stored in-memory, that needs to be sync'd. But it's true that volume actually doesn't matter - I can still get a reliable failure with SIGKILL that doesn't happen with SIGTERM.
Updated in c078b96 to be much simpler.
There was a problem hiding this comment.
Okay, if that still fails without the fix then that looks good. (It made sense to me before that you did a ton of work in phase 2 and then wrote the canary. I was only confused why phase 1 also needed to do a lot of work if phase 2 was already doing that.)
There was a problem hiding this comment.
Yup, still fails before / passes after the drop changes
.config/nextest.toml
Outdated
| # so it can wait "as long as the test runner decides is reasonable". | ||
| # `omicron-test-utils` is included in this block that uses a longer-than-normal | ||
| # timeout because under some conditions it waits for CockroachDB to shut down | ||
| # gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this |
There was a problem hiding this comment.
| # gracefully. Not that `omicron-test-utils` depends on `omicron-nexus`, so this | |
| # gracefully. Note that `omicron-test-utils` depends on `omicron-nexus`, so this |
| // write buffer. SIGTERM triggers Pebble's DB.Close() which flushes | ||
| // and fsyncs the WAL before exiting. |
There was a problem hiding this comment.
Cool... that's a bit much to put into the comment but I think it'd be worth providing some reference. Maybe just the first one?
test-utils/src/dev/db.rs
Outdated
| // Phase 1: Write enough data to overflow Pebble's internal WAL | ||
| // write buffer, ensuring it gets write()-ed to the kernel and | ||
| // is durable on disk. |
There was a problem hiding this comment.
Okay, if that still fails without the fix then that looks good. (It made sense to me before that you did a ton of work in phase 2 and then wrote the canary. I was only confused why phase 1 also needed to do a lot of work if phase 2 was already doing that.)
Uses SIGTERM for
CockroachInstancesthat are dropped without being cleaned, to propagate ongoingwrites back to durable storage for inspection.
Additionally, adds a test which validates this behavior (torn writes should no longer appear with the new SIGTERM-on-drop implementation).
Fixes #10085