[fm] always record the physical location of an ereport reporter by hawkw · Pull Request #10096 · oxidecomputer/omicron

hawkw · 2026-03-18T23:05:56Z

This is a large, and somewhat brutish, migration which attempts to
correct my past lack of forethought in designing the
omicron.public.ereport schema. In particular, a younger, dumber
version of Eliza foolishly chose to make the physical location of the
reporter in the rack (the sp_type and sp_slot columns) nullable, and
only incldue them when the reporter is a SP, and not when it's a
sled's host OS. This made a lot of people¹ very unhappy, and is
widely regarded as a bad move.

While SP reporters are uniquely indexed by the sp_type and sp_slot
(as they are the keys Nexus uses to request SP ereports from MGS), host
OS ereports are identified by the sled UUID (as it's the primary key of
the entry in the sled table through which Nexus will discover the
address of the sled-agent that it asks for the sled's ereports). At
the time, I thought that we would only need to hang onto the sled UUID,
as we could always get the physical slot of the sled by going and doing
some JOINs to look that up by sled UUID. However, it's much less
pleasant to do that than I had anticipated, as turning a sled UUID into
a slot requires looking up the hw_baseboard_id for the sled in the
inventory's inv_sled_agent table, and then using the hw_baseboard_id
in the inv_service_processor table, which actually knows the slot.
This is a bit of a pain to do, and because old inventory collections and
sled entries are deleted, we may no longer be able to find the slot
for a sled UUID that references a sled that no longer exists. Thus, we
really should have been recording the physical location in the ereport
table if we want to be able to have it for historic ereports.

This PR rights these wrongs by replacing the nullable sp_type and
sp_slot columns in the ereport table with non-null slot_type and
slot columns (renamed to reflect that they are no longer specifically
for SPs), and changing the CHECK constraints to permit host OS
ereports to also have those columns. We attempt to backfill the slot for
host OS ereports using the nasty join chain I described above. If we are
unable to do this for a host OS ereport because it refers to a sled UUID
that no longer exists in the inventory, we just delete it. This feels
quite icky, but it's worth noting that, at time of writing, we simply
don't have any code for collecting ereports from the host OS into CRDB
anyway, so there aren't actually going to be any actual ereports getting
dropped here --- making an attempt to backfill them is really just an
intellectual exercise, but it made me feel better.

Well...mostly just me. ↩

smklein · 2026-03-19T22:59:28Z

schema/crdb/ereport-slots-non-null/up04.sql

+        ON isa.hw_baseboard_id = isp.hw_baseboard_id
+        AND isa.inv_collection_id = isp.inv_collection_id
+    WHERE isa.hw_baseboard_id IS NOT NULL
+    ORDER BY isa.sled_id, isa.inv_collection_id DESC


You're ordering by UUIDs descending here , rather than like, a generation number of time collected or something. Is that intentional?

smklein · 2026-03-19T23:00:17Z

nexus/tests/integration_tests/schema.rs

 const PORT_SETTINGS_ID_165_2: &str = "8b777d9b-62a3-4c4d-b0b7-314315c2a7fc";
 const PORT_SETTINGS_ID_165_3: &str = "7c675e89-74b1-45da-9577-cf75f028107a";
-const PORT_SETTINGS_ID_165_4: &str = "e2413d63-9307-4918-b9c4-bce959c63042";
+const PORT_SETTINGS_ID_165_4: &str = "e2423d63-9307-4918-b9c4-bce959c63042";


This seems like it might be a bad merge? Why are we changing this UUID?

argh, i have no idea how that happened, i think the cat stepped on my keyboard or something. will fix

smklein · 2026-03-19T23:10:11Z

schema/crdb/dbinit.sql


+    /* physical slot location of the reporter. */
+    slot_type omicron.public.sp_type NOT NULL,
+    slot INT4 NOT NULL,


The implication here is that "slot will always be known to the host OS, for all ereports it ever generates", right?

I'm totally on-board with backfilling this value - preferring it to be non-null - but I want to make sure I understand the implications of this constraint that it cannot be NULL. We'll always have the slot, under all host OS ereports we care about?

That was kind of the entire intent of the change, yes. Am I correct that if a sled-agent is known to Nexus enough to be able to send it an HTTP request, it should also exist in the inventory?

(for the record, the idea was that the location would be coming from Nexus when it's writing what it collected from the sled, not from the sled itself)

Am I correct that if a sled-agent is known to Nexus enough to be able to send it an HTTP request, it should also exist in the inventory?

Should - yes. Guaranteed - no. Inventory is "best effort" collection (so can be pretty lossy), and is very non-atomic (collected over a period of several 10s of seconds, with several minutes between collections, in general). A couple examples where sled-agent could talk to Nexus but not be present in inventory:

The Nexus that collected the most recent inventory collection(s) was (or still is!) partitioned off from the sled, but the sled can find and send ereports to a different Nexus.

The sled was off or not present during the last inventory collection and sends ereports before a new collection has been made that sees it.

Yeah, this was why I asked - to me, this PR does two things:

It allows slot information to be present for sled-based ereports (we're removing the constraint that "slot be null for host-based ereports")

It also mandates that it exist (we're making slot "always not null")

I was poking at this edge case because I want to make sure "the set of ereports we can collect is the same if we do (1) vs (1 + 2)". Sounds like we might not have slot info in all cases?

Sounds like we might not have slot info in all cases?

I think this is true for hosts, yeah. For SPs we always know which slot a given message came from, because the act of receiving a message involves slot information (messages are received on specific network interfaces, which uniquely identify the physical slot that sent the message). But for hosts I don't think we have any such thing; they don't know their own slot, and I don't think we have a way to guarantee Nexus does either.

So, hmm. The intent behind this change was to make the slot information mandatory so that it could be used for the reporter restart ordering table I described in #10073 (comment). If we can't rely on a physical location being known for sleds that exist in the sled table, I think I'm gonna have to rethink how we want to approach the reporter restart ordering. I suppose we could just treat anything where we don't know its slot as not being part of a restart order, and allow collecting ereports from stuff that isn't understood to be part of that order.

Okay, so I talked through this a bit with @smklein, and I think I know how I want to move forwards with here. This is all based on the fact that, for SP ereports, we will always know the slot (because it's how we talk to the SP through MGS), but may not know the VPD (in the event of a VPD read error), while for host OS ereports, we will always know the VPD (as it's necessary to join the trust quorum), but may not always know the location as discussed here. Therefore, I think I want to do the following:

Leave the slot field nullable, but change the check constraint so that it is allowed to be non-null for host OS ereports.

We could make slot_type non-null, since for host OS reporters, it will always be a sled by definition.

Continue to move forwards with using (non-null) slots as the key for reporter restart ordering. If we receive ereports from a sled-agent before we know its slot, we just don't add that reporter restart ID to the restart order yet.

If we eventually learn the location of that reporter, it can be added to the restart order for that location

We can still track which ereports we have seen from a reporter that is not part of a slot's restart order, we just can't also use it to say "also don't look for new ereports from these reporters as they represent previous restarts of this thing.". This is actually not that bad.

Basically the idea here is that the restart ordering table is essentially saying "this is the current reporter occupying this physical location." If we also have ereports for which we don't know a physical location, that's okay, they just can't participate in that ordering. If all our code approaches things from that perspective, this basically all seems fine.

smklein · 2026-03-19T23:11:48Z

schema/crdb/dbinit.sql

    sled_id UUID,

+    /* physical slot location of the reporter. */
+    slot_type omicron.public.sp_type NOT NULL,


Should this type also be renamed to "slot_type"?

It's funny for a host OS to reference the "sp_type::sled", because we really are referring to "the sled", rather than "the sled's SP" in that case.

Also I respect that this might be a pain in the ass from a schema point-of-view, so, push back if this sucks too much

unfortunately, uh, this enum is used all over the place and i didn't really think it was prudent to do the amount of deleting and recreating columns it would take to rename the type...?

fair enough, that's why I gave the caveat. I can live with us saying "the host sled is a service processor of type = sled!", even though that's kinda untrue

hawkw added 10 commits March 18, 2026 16:04

huge ereport table slot refactor

cfe914a

fix ordering

d0283b9

fix other ordering

79ee34f

you have to remember to run rustfmt

346d653

tidy up migration

536cfb6

add migration readme

c787811

blurgh

c58575c

Merge branch 'main' into eliza/ereport-refactor

58f2fe0

rename the migration

dcb9459

post-merge fixup test util

15b9b4f

hawkw changed the title ~~[wip] always record the physical location of an ereport reporter~~ [fm] always record the physical location of an ereport reporter Mar 19, 2026

hawkw marked this pull request as ready for review March 19, 2026 17:20

hawkw requested review from mergeconflict and smklein March 19, 2026 17:21

hawkw self-assigned this Mar 19, 2026

hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Mar 19, 2026

post-merge fixup migration version

fe5a6bf

smklein reviewed Mar 19, 2026

View reviewed changes

Conversation

hawkw commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Mar 18, 2026 •

edited

Loading