Reconfigurator: Add live test that executes expunged zones that were never in service by jgallagher · Pull Request #10081 · oxidecomputer/omicron

jgallagher · 2026-03-18T00:32:24Z

This is an attempt to catch other issues like #10025, and implements the reproduction steps described there as a live test, but applied to most zone types and not just Nexus. (We skip multinode clickhouse, because those aren't deployed by default, and internal DNS, because the planner can't replace it without execution running first anyway).

This took a bunch of tries to get passing, and I'd be very unsurprised if there are other kinds of flakes still lurking here. We're not running live tests as a part of CI, so I'm not sure how worried to be about this.

Includes a few bits of live test housekeeping (updates to the README and racklette serial number sets).

…d zones

* no need to enable execution by default * DO need to disable planning by default

jgallagher · 2026-03-18T00:34:34Z

Running this test on london against a branch that did not have the fix for #10025, we see the failure we'd expect: the test times out waiting for blueprint execution to succeed:

  stderr ---
    log file: /var/tmp/test_execute_expunged_zone-78ccd669b96cb1ec-test_execute_expunged_zone.11905.0.log
    note: configured to log to "/var/tmp/test_execute_expunged_zone-78ccd669b96cb1ec-test_execute_expunged_zone.11905.0.log"
    note: using DNS from system config (typically /etc/resolv.conf)

    thread 'test_execute_expunged_zone' (2) panicked at live-tests/tests/test_execute_expunged_zone.rs:367:6:
    waited for successful execution: TimedOut(180.373621863s)
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and the log file shows the exact error we saw in #10025:

17:21:17.796Z WARN test_execute_expunged_zone: execution had an error
    error = step failed: Ensure external networking resources: Internal Error: unexpected database error: Record not found

Running against a branch that does have that fix (as well as #10072, which fell out of developing this test), we get a pass but it takes a while; most of this time is waiting for cockroach to be healthy again after expunging one of its nodes:

    Starting 1 test across 3 binaries (2 tests skipped)
        SLOW [> 60.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>120.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>180.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>240.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>300.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>360.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>420.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>480.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>540.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>600.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>660.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>720.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>780.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>840.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>900.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        PASS [ 949.973s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
------------
     Summary [ 949.974s] 1 test run: 1 passed (1 slow), 2 skipped

davepacheco · 2026-03-20T18:14:45Z

live-tests/tests/common/mod.rs

+    ) -> anyhow::Result<(OpContext, Arc<DataStore>)> {
+        let log = &self.logctx.log;
+        let datastore = create_datastore(log, &self.resolver).await?;
+        let opctx = OpContext::for_tests(log.clone(), datastore.clone());


I was surprised this returned a new OpContext, too. Is that because the older one is still connected to the previous datastore which is still trying to talk to now-missing Cockroach zones?

davepacheco · 2026-03-20T18:17:56Z

live-tests/README.adoc


 Ensure the system's target blueprint is enabled. The live tests require this to avoid a case where the live tests generate blueprints based on a target blueprint that is not current, and then make a bunch of changes to the system unrelated to the tests.

-On a fresh system, you will have to enable the target blueprint yourself:


I take it that you don't need to enable the target blueprint yourself any more?

davepacheco · 2026-03-20T18:19:18Z

live-tests/tests/common/reconfigurator.rs

+    blueprint_edit_current_target_impl(log, nexus, true, edit_fn).await
+}
+
+/// Modify the system by editing the current target blueprint


Suggested change

/// Modify the system by editing the current target blueprint

/// Modify the system by editing the current target blueprint, verifying that the current target is disabled

davepacheco · 2026-03-20T18:19:56Z

live-tests/tests/common/reconfigurator.rs

    Ok(blueprint)
 }

 /// Modify the system by editing the current target blueprint


Suggested change

/// Modify the system by editing the current target blueprint, verifying that the current target is enabled

davepacheco · 2026-03-20T18:23:16Z