Skip to content

schema + db-model: support bundle request data selection tables#10111

Open
mergeconflict wants to merge 3 commits intomainfrom
mergeconflict/support-bundle-request-schema-and-model
Open

schema + db-model: support bundle request data selection tables#10111
mergeconflict wants to merge 3 commits intomainfrom
mergeconflict/support-bundle-request-schema-and-model

Conversation

@mergeconflict
Copy link
Contributor

@mergeconflict mergeconflict commented Mar 20, 2026

Adds two tables:

  • fm_support_bundle_request (the primary table representing nexus_types::fm::case::SupportBundleRequest
  • fm_sb_req_data_selection (the join table representing the nexus_types::support_bundle::BundleDataSelections in the support bundle request).

... and their corresponding models.

This is for persisting fault-management-initiated support bundle requests as part of a sitrep. Each request belongs to a case and tracks which sitrep originally requested it. Context: #10062.

This supersedes #10091 and #10095, in which the schema had a different table for each BundleDataSelection variant. That got hairy because we were joining 6 tables together and Diesel got mad about it. (Then I got mad.)

@mergeconflict mergeconflict changed the base branch from main to mergeconflict/builder-side-quest March 20, 2026 16:56
@mergeconflict mergeconflict added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Mar 20, 2026
@mergeconflict mergeconflict self-assigned this Mar 20, 2026
@mergeconflict mergeconflict requested review from hawkw, jgallagher and smklein and removed request for hawkw and smklein March 20, 2026 17:03
@mergeconflict mergeconflict force-pushed the mergeconflict/support-bundle-request-schema-and-model branch 2 times, most recently from 476acc0 to 88e7b32 Compare March 20, 2026 17:32
BundleDataCategory::SledCubbyInfo => BundleData::SledCubbyInfo,
BundleDataCategory::SpDumps => BundleData::SpDumps,
BundleDataCategory::HostInfo => {
if self.all_sleds.expect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay, but since the consequence of being wrong is crashing Nexus, we usually make these runtime errors; e.g.,

(DbBpPhysicalDiskDisposition::InService, Some(_))
| (DbBpPhysicalDiskDisposition::Expunged, None) => Err(anyhow!(
"illegal database state (CHECK constraint broken?!): \
disposition {:?}, disposition_expunged_as_of_generation {:?}",
value.disposition,
value.expunged_as_of_generation,
)),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ef06169. Question about this, though: should we basically never use expect in production code? In the past, I tend to use Result-like things only for expected errors, and prefer to panic for cases that we think really shouldn't happen (M-PANIC-ON-BUG). Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally think "expect/unwrap" is fine if you're writing code that is truly impossible to trigger, under any circumstances. My metric for this is basically self-consistency: within the Nexus binary, is this always impossible?

This does not pass the bar - we mention the CHECK constraint here, but CockroachDB is an external process, which could be modified independently of Nexus. It could return whatever data, the version could skew, etc. Under that lens, data returned here should be treated as "arbitrary input".

With that lens: It's totally fine to throw an error if the data from CockroachDB looks unexpected, but that seems very different to me from "panic the whole process".

Copy link
Contributor Author

@mergeconflict mergeconflict Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein Makes sense, that's a good explanation. Mismatched expectations due to version skew or whatever are super subtle and I've definitely been burned by them within the last year.

To be clear: when you say "throw an error" you mean "return an Err result", yeah?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: when you say "throw an error" you mean "return an Err result", yeah?

Yeah, exactly - we could treat this as a serialization error ("I refuse to read this from the database!") or as a parsing error ("I can read the raw data from the database, but I'm preventing us from transforming this into a usable type"), or something else -- as long as it doesn't crash the Nexus process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ef06169. Question about this, though: should we basically never use expect in production code? In the past, I tend to use Result-like things only for expected errors, and prefer to panic for cases that we think really shouldn't happen (M-PANIC-ON-BUG). Thoughts?

+1 on all of Sean's comments - sorry, I should have gone into more details. I agree with panicking on bugs. Some cases that I know we use expect() that I think are fine are:

  • Serializing a type to JSON in memory (this can only panic if the type isn't representable in JSON, which would be a pretty severe bug if we expect it to be JSON-able)
  • Pulling items out of maps by keys that we know exist, because we locally created the map or already checked that the key exists or somesuch
  • Calling fallible functions with data that obviously can't fail (e.g., NonZeroU8::new(123).unwrap(), because NonZeroU8::new() only fails if passed 0)

Some cases we've used expect() in the past (and maybe still do) but shouldn't:

  • Reading from the DB, as in this case for the reasons Sean noted
  • Pulling items out of maps by keys that we're pretty sure ought to exist (Nexus panic inside sync_switch_configuration background task #8579 is a specific example of this: we run two different but closely related db queries in separate transactions and assume they have the same contents - that's almost always true, unless someone happens to have changed them in between the two queries)
  • Assuming some invariant is always upheld when it's only upheld sometimes (Large silo quota amount could crash nexus #5732 is a specific example of this, where we reasonably assumed the ByteCount newtype could only have valid contents, but because we #[derive(Deserialize)]'d it, it could be constructed via deserialize without upholding the invariants). This case is a little fiddly - I think more often than not the expect() is fine and we need to ensure the invariants are always upheld (e.g. in fixing Large silo quota amount could crash nexus #5732, we manually implemented Deserialize to enforce the invariant).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all of the above. I've filed #10118 on an idea we've discussed before to be able to report operational errors that we believe should be impossible.

@mergeconflict mergeconflict force-pushed the mergeconflict/builder-side-quest branch 2 times, most recently from 770be0e to 8153e6f Compare March 20, 2026 17:57
@mergeconflict mergeconflict force-pushed the mergeconflict/support-bundle-request-schema-and-model branch from 88e7b32 to d28f22b Compare March 20, 2026 17:57
Adds fm_support_bundle_request and fm_sb_req_data_selection tables
for persisting FM-initiated support bundle requests as part of a
sitrep. Each request belongs to a case and tracks which sitrep
originally requested it. The data selection table uses a
bundle_data_category enum and category-specific nullable columns
with CHECK constraints to store per-category parameters (sled
selection for host_info, time/serial/class filters for ereports).
@mergeconflict mergeconflict force-pushed the mergeconflict/builder-side-quest branch from 8153e6f to c171063 Compare March 20, 2026 18:08
@mergeconflict mergeconflict force-pushed the mergeconflict/support-bundle-request-schema-and-model branch from d28f22b to 0c999b9 Compare March 20, 2026 18:38
…anicking

- Replace `array_length(sled_ids, 1) IS NULL` with `cardinality(sled_ids) = 0`
  in the CHECK constraint, avoiding a counterintuitive PostgreSQL/CRDB quirk
  where `array_length` of an empty array returns NULL.

- Replace `expect()` calls in `SbReqDataSelection::into_bundle_data` with
  `anyhow::Context`-based error returns, so a broken CHECK constraint
  produces a runtime error rather than crashing Nexus.
Base automatically changed from mergeconflict/builder-side-quest to main March 20, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants