Reconfigurator: Add live test that executes expunged zones that were never in service#10081
Reconfigurator: Add live test that executes expunged zones that were never in service#10081jgallagher wants to merge 14 commits intomainfrom
Conversation
* no need to enable execution by default * DO need to disable planning by default
|
Running this test on and the log file shows the exact error we saw in #10025: Running against a branch that does have that fix (as well as #10072, which fell out of developing this test), we get a pass but it takes a while; most of this time is waiting for cockroach to be healthy again after expunging one of its nodes: |
| ) -> anyhow::Result<(OpContext, Arc<DataStore>)> { | ||
| let log = &self.logctx.log; | ||
| let datastore = create_datastore(log, &self.resolver).await?; | ||
| let opctx = OpContext::for_tests(log.clone(), datastore.clone()); |
There was a problem hiding this comment.
I was surprised this returned a new OpContext, too. Is that because the older one is still connected to the previous datastore which is still trying to talk to now-missing Cockroach zones?
|
|
||
| Ensure the system's target blueprint is enabled. The live tests require this to avoid a case where the live tests generate blueprints based on a target blueprint that is not current, and then make a bunch of changes to the system unrelated to the tests. | ||
|
|
||
| On a fresh system, you will have to enable the target blueprint yourself: |
There was a problem hiding this comment.
I take it that you don't need to enable the target blueprint yourself any more?
| blueprint_edit_current_target_impl(log, nexus, true, edit_fn).await | ||
| } | ||
|
|
||
| /// Modify the system by editing the current target blueprint |
There was a problem hiding this comment.
| /// Modify the system by editing the current target blueprint | |
| /// Modify the system by editing the current target blueprint, verifying that the current target is disabled |
| Ok(blueprint) | ||
| } | ||
|
|
||
| /// Modify the system by editing the current target blueprint |
There was a problem hiding this comment.
| /// Modify the system by editing the current target blueprint, verifying that the current target is enabled |
| use update_engine::events::EventReport; | ||
| use update_engine::events::StepOutcome; | ||
|
|
||
| #[live_test] |
There was a problem hiding this comment.
Maybe document the scope of what this test is trying to cover?
Maybe something like:
For each type of zone, test the behavior of executing a blueprint with that zone expunged when we never executed a blueprint with that zone in-service. This covers edge cases during zone cleanup (where we might try to clean up resources that were never actually set up).
| // Safety check: If running this test multiple times, we may leave behind | ||
| // underreplicated cockroach ranges. We shouldn't attempt to proceed if | ||
| // those haven't been repaired yet. Get the latest inventory collection and | ||
| // check. | ||
| match datastore.inventory_get_latest_collection(opctx).await { | ||
| Ok(Some(collection)) => { | ||
| if let Err(err) = | ||
| validate_cockroach_is_healthy_according_to_inventory( | ||
| &collection, | ||
| ) | ||
| { | ||
| panic!("refusing to run live test: {err:#}"); | ||
| } | ||
| } | ||
| Ok(None) => panic!( | ||
| "refusing to run live test: no inventory collections exist yet" | ||
| ), | ||
| Err(err) => panic!( | ||
| "refusing to run live test: \ | ||
| error fetching inventory collection: {}", | ||
| InlineErrorChain::new(&err), | ||
| ), | ||
| } |
There was a problem hiding this comment.
I wonder if this should be a helper in LiveTestContext (or elsewhere) that gets invoked in all tests. (Basically, what if some other test runs after this one and runs into this?)
Also, should this be in a wait_for_condition to avoid flaking?
| // Skip internal DNS for now - our test relies on the planner | ||
| // immediately replacing a zone we expunge. The planner (correctly!) | ||
| // does not do this for internal DNS: it has to wait for inventory | ||
| // to reflect that the zone has been expunged, which can't happen | ||
| // during our test because we've disabled execution. |
There was a problem hiding this comment.
Does the planner avoid placing a new internal DNS zone until the old one is definitely gone to avoid its IP address being in two places at once?
| // exists and should be put into service | ||
|
|
||
| // Disable planning and execution. | ||
| disable_blueprint_planning(nexus, log).await; |
There was a problem hiding this comment.
You updated the README to say that the user should have already done this. Does that mean we can skip this here? (Do we also check that it is disabled when we start the live test context? Feels like we should.)
| let status = match nexus.bgtask_view("blueprint_executor").await { | ||
| Ok(task_view) => task_view.into_inner().last, | ||
| Err(err) => { | ||
| // We don't generally expect this to fail, but it could if | ||
| // we're testing Nexus or Cockroach (or recently did), since | ||
| // either of those being expunged may cause transient errors | ||
| // attempting to fetch task status from an arbitrary Nexus | ||
| // present in DNS. | ||
| warn!( | ||
| log, | ||
| "failed to get blueprint_executor status from Nexus"; | ||
| InlineErrorChain::new(&err), | ||
| ); | ||
| return Err(CondCheckError::NotYet); | ||
| } | ||
| }; | ||
|
|
||
| let details = match status { | ||
| LastResult::NeverCompleted => { | ||
| info!( | ||
| log, | ||
| "still waiting for execution: task never completed" | ||
| ); | ||
| return Err(CondCheckError::NotYet); | ||
| } | ||
| LastResult::Completed(completed) => completed.details, | ||
| }; | ||
|
|
||
| #[derive(Deserialize)] | ||
| struct BlueprintExecutorStatus { | ||
| target_id: BlueprintUuid, | ||
| execution_error: Option<NestedError>, | ||
| event_report: Result<EventReport<NestedSpec>, NestedError>, | ||
| } | ||
|
|
||
| // We won't be able to parse the `details` we expect if the | ||
| // `bgtask_view()` we just executed is still returning the status | ||
| // from a previous execution attempt of a disabled blueprint, so | ||
| // return `NotYet` until we can parse them. | ||
| let details: BlueprintExecutorStatus = match serde_json::from_value( | ||
| details, | ||
| ) { | ||
| Ok(details) => details, | ||
| Err(err) => { | ||
| info!( | ||
| log, | ||
| "still waiting for execution: failed to parse details"; | ||
| InlineErrorChain::new(&err), | ||
| ); | ||
| return Err(CondCheckError::NotYet); | ||
| } | ||
| }; | ||
|
|
||
| // Make sure we're looking at the execution of the blueprint we care | ||
| // about; otherwise, keep waiting. | ||
| if details.target_id != planned_bp_2.id { | ||
| warn!( | ||
| log, | ||
| "still waiting for execution: executed blueprint ID \ | ||
| does not match ID we're waiting for"; | ||
| "executed_id" => %details.target_id, | ||
| "waiting_for_id" => %planned_bp_2.id, | ||
| ); | ||
| return Err(CondCheckError::NotYet); | ||
| } | ||
|
|
||
| // Check that execution completed cleanly. | ||
| if let Some(err) = details.execution_error { | ||
| warn!( | ||
| log, "execution had an error"; | ||
| InlineErrorChain::new(&err), | ||
| ); | ||
| return Err(CondCheckError::NotYet); | ||
| } | ||
|
|
||
| match details.event_report { | ||
| Ok(event_report) => { | ||
| // If the event report indicates any problems, we'll try | ||
| // again - we expect to see some transient problems and | ||
| // warnings, but also expect them to clear up on their own | ||
| // pretty quickly. | ||
| if event_report_has_problems(event_report, log) { | ||
| Err(CondCheckError::NotYet) | ||
| } else { | ||
| Ok(()) | ||
| } | ||
| } | ||
| Err(err) => Err(CondCheckError::Failed(format!( | ||
| "no available event report: {}", | ||
| InlineErrorChain::new(&err), | ||
| ))), | ||
| } |
There was a problem hiding this comment.
I usually steer people away from depending on specific executions of a background task and instead towards whatever side effect they care about. In this case I feel like you could wait for inventory collections to reflect that the blueprint zone configs have been propagated to all sleds. (I think we have a helper for that?)
But I guess you're specifically trying to see if execution generated any problems, even if the configs did get propagated? It still feels unfortunate you have to do all this work here (though it was built so that you could do this rigorously, so I guess that's something).
Maybe add a comment then about why we're doing it this way and please don't copy/paste it elsewhere?
| // Wait until the inventory collection is sufficiently new. | ||
| if collection.time_started < wait_start_time { |
There was a problem hiding this comment.
In principle, this could still flake if systems' clocks are not totally in sync. I'm trying to think if there's some datum in inventory that's more causally related to what you care about. What you're trying to work out here is that the inventory doesn't predate the expungement you did, right? Could you use blueprint_wait_sled_configs_propagated() here? I'm thinking if those have been propagated, then the old cockroach isn't running... but I guess you still don't know for sure that cockroach is reporting itself as unhealthy immediately. This seems tricky.
This is an attempt to catch other issues like #10025, and implements the reproduction steps described there as a live test, but applied to most zone types and not just Nexus. (We skip multinode clickhouse, because those aren't deployed by default, and internal DNS, because the planner can't replace it without execution running first anyway).
This took a bunch of tries to get passing, and I'd be very unsurprised if there are other kinds of flakes still lurking here. We're not running live tests as a part of CI, so I'm not sure how worried to be about this.
Includes a few bits of live test housekeeping (updates to the README and racklette serial number sets).