Skip to content

Reconfigurator: Add live test that executes expunged zones that were never in service#10081

Open
jgallagher wants to merge 14 commits intomainfrom
john/execute-expunged-live-test
Open

Reconfigurator: Add live test that executes expunged zones that were never in service#10081
jgallagher wants to merge 14 commits intomainfrom
john/execute-expunged-live-test

Conversation

@jgallagher
Copy link
Contributor

This is an attempt to catch other issues like #10025, and implements the reproduction steps described there as a live test, but applied to most zone types and not just Nexus. (We skip multinode clickhouse, because those aren't deployed by default, and internal DNS, because the planner can't replace it without execution running first anyway).

This took a bunch of tries to get passing, and I'd be very unsurprised if there are other kinds of flakes still lurking here. We're not running live tests as a part of CI, so I'm not sure how worried to be about this.

Includes a few bits of live test housekeeping (updates to the README and racklette serial number sets).

@jgallagher
Copy link
Contributor Author

Running this test on london against a branch that did not have the fix for #10025, we see the failure we'd expect: the test times out waiting for blueprint execution to succeed:

  stderr ---
    log file: /var/tmp/test_execute_expunged_zone-78ccd669b96cb1ec-test_execute_expunged_zone.11905.0.log
    note: configured to log to "/var/tmp/test_execute_expunged_zone-78ccd669b96cb1ec-test_execute_expunged_zone.11905.0.log"
    note: using DNS from system config (typically /etc/resolv.conf)

    thread 'test_execute_expunged_zone' (2) panicked at live-tests/tests/test_execute_expunged_zone.rs:367:6:
    waited for successful execution: TimedOut(180.373621863s)
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and the log file shows the exact error we saw in #10025:

17:21:17.796Z WARN test_execute_expunged_zone: execution had an error
    error = step failed: Ensure external networking resources: Internal Error: unexpected database error: Record not found

Running against a branch that does have that fix (as well as #10072, which fell out of developing this test), we get a pass but it takes a while; most of this time is waiting for cockroach to be healthy again after expunging one of its nodes:

    Starting 1 test across 3 binaries (2 tests skipped)
        SLOW [> 60.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>120.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>180.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>240.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>300.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>360.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>420.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>480.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>540.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>600.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>660.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>720.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>780.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>840.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        SLOW [>900.000s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
        PASS [ 949.973s] omicron-live-tests::test_execute_expunged_zone test_execute_expunged_zone
------------
     Summary [ 949.974s] 1 test run: 1 passed (1 slow), 2 skipped

) -> anyhow::Result<(OpContext, Arc<DataStore>)> {
let log = &self.logctx.log;
let datastore = create_datastore(log, &self.resolver).await?;
let opctx = OpContext::for_tests(log.clone(), datastore.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised this returned a new OpContext, too. Is that because the older one is still connected to the previous datastore which is still trying to talk to now-missing Cockroach zones?


Ensure the system's target blueprint is enabled. The live tests require this to avoid a case where the live tests generate blueprints based on a target blueprint that is not current, and then make a bunch of changes to the system unrelated to the tests.

On a fresh system, you will have to enable the target blueprint yourself:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it that you don't need to enable the target blueprint yourself any more?

blueprint_edit_current_target_impl(log, nexus, true, edit_fn).await
}

/// Modify the system by editing the current target blueprint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Modify the system by editing the current target blueprint
/// Modify the system by editing the current target blueprint, verifying that the current target is disabled

Ok(blueprint)
}

/// Modify the system by editing the current target blueprint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Modify the system by editing the current target blueprint, verifying that the current target is enabled

use update_engine::events::EventReport;
use update_engine::events::StepOutcome;

#[live_test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document the scope of what this test is trying to cover?

Maybe something like:

For each type of zone, test the behavior of executing a blueprint with that zone expunged when we never executed a blueprint with that zone in-service. This covers edge cases during zone cleanup (where we might try to clean up resources that were never actually set up).

Comment on lines +51 to +73
// Safety check: If running this test multiple times, we may leave behind
// underreplicated cockroach ranges. We shouldn't attempt to proceed if
// those haven't been repaired yet. Get the latest inventory collection and
// check.
match datastore.inventory_get_latest_collection(opctx).await {
Ok(Some(collection)) => {
if let Err(err) =
validate_cockroach_is_healthy_according_to_inventory(
&collection,
)
{
panic!("refusing to run live test: {err:#}");
}
}
Ok(None) => panic!(
"refusing to run live test: no inventory collections exist yet"
),
Err(err) => panic!(
"refusing to run live test: \
error fetching inventory collection: {}",
InlineErrorChain::new(&err),
),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a helper in LiveTestContext (or elsewhere) that gets invoked in all tests. (Basically, what if some other test runs after this one and runs into this?)

Also, should this be in a wait_for_condition to avoid flaking?

Comment on lines +83 to +87
// Skip internal DNS for now - our test relies on the planner
// immediately replacing a zone we expunge. The planner (correctly!)
// does not do this for internal DNS: it has to wait for inventory
// to reflect that the zone has been expunged, which can't happen
// during our test because we've disabled execution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the planner avoid placing a new internal DNS zone until the old one is definitely gone to avoid its IP address being in two places at once?

// exists and should be put into service

// Disable planning and execution.
disable_blueprint_planning(nexus, log).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You updated the README to say that the user should have already done this. Does that mean we can skip this here? (Do we also check that it is disabled when we start the live test context? Feels like we should.)

Comment on lines +276 to +367
let status = match nexus.bgtask_view("blueprint_executor").await {
Ok(task_view) => task_view.into_inner().last,
Err(err) => {
// We don't generally expect this to fail, but it could if
// we're testing Nexus or Cockroach (or recently did), since
// either of those being expunged may cause transient errors
// attempting to fetch task status from an arbitrary Nexus
// present in DNS.
warn!(
log,
"failed to get blueprint_executor status from Nexus";
InlineErrorChain::new(&err),
);
return Err(CondCheckError::NotYet);
}
};

let details = match status {
LastResult::NeverCompleted => {
info!(
log,
"still waiting for execution: task never completed"
);
return Err(CondCheckError::NotYet);
}
LastResult::Completed(completed) => completed.details,
};

#[derive(Deserialize)]
struct BlueprintExecutorStatus {
target_id: BlueprintUuid,
execution_error: Option<NestedError>,
event_report: Result<EventReport<NestedSpec>, NestedError>,
}

// We won't be able to parse the `details` we expect if the
// `bgtask_view()` we just executed is still returning the status
// from a previous execution attempt of a disabled blueprint, so
// return `NotYet` until we can parse them.
let details: BlueprintExecutorStatus = match serde_json::from_value(
details,
) {
Ok(details) => details,
Err(err) => {
info!(
log,
"still waiting for execution: failed to parse details";
InlineErrorChain::new(&err),
);
return Err(CondCheckError::NotYet);
}
};

// Make sure we're looking at the execution of the blueprint we care
// about; otherwise, keep waiting.
if details.target_id != planned_bp_2.id {
warn!(
log,
"still waiting for execution: executed blueprint ID \
does not match ID we're waiting for";
"executed_id" => %details.target_id,
"waiting_for_id" => %planned_bp_2.id,
);
return Err(CondCheckError::NotYet);
}

// Check that execution completed cleanly.
if let Some(err) = details.execution_error {
warn!(
log, "execution had an error";
InlineErrorChain::new(&err),
);
return Err(CondCheckError::NotYet);
}

match details.event_report {
Ok(event_report) => {
// If the event report indicates any problems, we'll try
// again - we expect to see some transient problems and
// warnings, but also expect them to clear up on their own
// pretty quickly.
if event_report_has_problems(event_report, log) {
Err(CondCheckError::NotYet)
} else {
Ok(())
}
}
Err(err) => Err(CondCheckError::Failed(format!(
"no available event report: {}",
InlineErrorChain::new(&err),
))),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually steer people away from depending on specific executions of a background task and instead towards whatever side effect they care about. In this case I feel like you could wait for inventory collections to reflect that the blueprint zone configs have been propagated to all sleds. (I think we have a helper for that?)

But I guess you're specifically trying to see if execution generated any problems, even if the configs did get propagated? It still feels unfortunate you have to do all this work here (though it was built so that you could do this rigorously, so I guess that's something).

Maybe add a comment then about why we're doing it this way and please don't copy/paste it elsewhere?

Comment on lines +615 to +616
// Wait until the inventory collection is sufficiently new.
if collection.time_started < wait_start_time {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, this could still flake if systems' clocks are not totally in sync. I'm trying to think if there's some datum in inventory that's more causally related to what you care about. What you're trying to work out here is that the inventory doesn't predate the expungement you did, right? Could you use blueprint_wait_sled_configs_propagated() here? I'm thinking if those have been propagated, then the old cockroach isn't running... but I guess you still don't know for sure that cockroach is reporting itself as unhealthy immediately. This seems tricky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants