schema + db-model: support bundle request data selection tables#10111
schema + db-model: support bundle request data selection tables#10111mergeconflict wants to merge 3 commits intomainfrom
Conversation
476acc0 to
88e7b32
Compare
| BundleDataCategory::SledCubbyInfo => BundleData::SledCubbyInfo, | ||
| BundleDataCategory::SpDumps => BundleData::SpDumps, | ||
| BundleDataCategory::HostInfo => { | ||
| if self.all_sleds.expect( |
There was a problem hiding this comment.
This should be okay, but since the consequence of being wrong is crashing Nexus, we usually make these runtime errors; e.g.,
omicron/nexus/db-model/src/deployment.rs
Lines 382 to 388 in 72b5433
There was a problem hiding this comment.
Fixed in ef06169. Question about this, though: should we basically never use expect in production code? In the past, I tend to use Result-like things only for expected errors, and prefer to panic for cases that we think really shouldn't happen (M-PANIC-ON-BUG). Thoughts?
There was a problem hiding this comment.
I generally think "expect/unwrap" is fine if you're writing code that is truly impossible to trigger, under any circumstances. My metric for this is basically self-consistency: within the Nexus binary, is this always impossible?
This does not pass the bar - we mention the CHECK constraint here, but CockroachDB is an external process, which could be modified independently of Nexus. It could return whatever data, the version could skew, etc. Under that lens, data returned here should be treated as "arbitrary input".
With that lens: It's totally fine to throw an error if the data from CockroachDB looks unexpected, but that seems very different to me from "panic the whole process".
There was a problem hiding this comment.
@smklein Makes sense, that's a good explanation. Mismatched expectations due to version skew or whatever are super subtle and I've definitely been burned by them within the last year.
To be clear: when you say "throw an error" you mean "return an Err result", yeah?
There was a problem hiding this comment.
To be clear: when you say "throw an error" you mean "return an Err result", yeah?
Yeah, exactly - we could treat this as a serialization error ("I refuse to read this from the database!") or as a parsing error ("I can read the raw data from the database, but I'm preventing us from transforming this into a usable type"), or something else -- as long as it doesn't crash the Nexus process
There was a problem hiding this comment.
Fixed in ef06169. Question about this, though: should we basically never use
expectin production code? In the past, I tend to useResult-like things only for expected errors, and prefer to panic for cases that we think really shouldn't happen (M-PANIC-ON-BUG). Thoughts?
+1 on all of Sean's comments - sorry, I should have gone into more details. I agree with panicking on bugs. Some cases that I know we use expect() that I think are fine are:
- Serializing a type to JSON in memory (this can only panic if the type isn't representable in JSON, which would be a pretty severe bug if we expect it to be JSON-able)
- Pulling items out of maps by keys that we know exist, because we locally created the map or already checked that the key exists or somesuch
- Calling fallible functions with data that obviously can't fail (e.g.,
NonZeroU8::new(123).unwrap(), becauseNonZeroU8::new()only fails if passed 0)
Some cases we've used expect() in the past (and maybe still do) but shouldn't:
- Reading from the DB, as in this case for the reasons Sean noted
- Pulling items out of maps by keys that we're pretty sure ought to exist (Nexus panic inside
sync_switch_configurationbackground task #8579 is a specific example of this: we run two different but closely related db queries in separate transactions and assume they have the same contents - that's almost always true, unless someone happens to have changed them in between the two queries) - Assuming some invariant is always upheld when it's only upheld sometimes (Large silo quota amount could crash nexus #5732 is a specific example of this, where we reasonably assumed the
ByteCountnewtype could only have valid contents, but because we#[derive(Deserialize)]'d it, it could be constructed via deserialize without upholding the invariants). This case is a little fiddly - I think more often than not theexpect()is fine and we need to ensure the invariants are always upheld (e.g. in fixing Large silo quota amount could crash nexus #5732, we manually implementedDeserializeto enforce the invariant).
There was a problem hiding this comment.
Agree with all of the above. I've filed #10118 on an idea we've discussed before to be able to report operational errors that we believe should be impossible.
770be0e to
8153e6f
Compare
88e7b32 to
d28f22b
Compare
Adds fm_support_bundle_request and fm_sb_req_data_selection tables for persisting FM-initiated support bundle requests as part of a sitrep. Each request belongs to a case and tracks which sitrep originally requested it. The data selection table uses a bundle_data_category enum and category-specific nullable columns with CHECK constraints to store per-category parameters (sled selection for host_info, time/serial/class filters for ereports).
8153e6f to
c171063
Compare
d28f22b to
0c999b9
Compare
…anicking - Replace `array_length(sled_ids, 1) IS NULL` with `cardinality(sled_ids) = 0` in the CHECK constraint, avoiding a counterintuitive PostgreSQL/CRDB quirk where `array_length` of an empty array returns NULL. - Replace `expect()` calls in `SbReqDataSelection::into_bundle_data` with `anyhow::Context`-based error returns, so a broken CHECK constraint produces a runtime error rather than crashing Nexus.
Adds two tables:
fm_support_bundle_request(the primary table representingnexus_types::fm::case::SupportBundleRequestfm_sb_req_data_selection(the join table representing thenexus_types::support_bundle::BundleDataSelections in the support bundle request).... and their corresponding models.
This is for persisting fault-management-initiated support bundle requests as part of a sitrep. Each request belongs to a case and tracks which sitrep originally requested it. Context: #10062.
This supersedes #10091 and #10095, in which the schema had a different table for each
BundleDataSelectionvariant. That got hairy because we were joining 6 tables together and Diesel got mad about it. (Then I got mad.)