Skip to content

Conversation

@pierugo-dfinity
Copy link
Contributor

@pierugo-dfinity pierugo-dfinity commented Nov 4, 2025

In cases where the orchestrator breaks after an upgrade, provisioning readonly SSH keys would not be possible to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Though, the state manager regularly logs the latest computed state hash and a recovery operator could look at logs of all nodes and use this information to build a recovery CUP.

This PR makes sure that this log is always observed from the log endpoint before an upgrade, and compares it with the state hash that the node reboots with. That way, we test that we can reliably observe this log before a node reboots and we make sure that the test would break if that log was ever removed in the future.

Furthermore, it always checks that the orchestrator gracefully exits. This was not the case because back when the test was added, the relevant log was not merged into mainnet yet, so we could only test it when downgrading HEAD to mainnet and not the opposite.

@github-actions github-actions bot added the test label Nov 4, 2025
@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Nov 4, 2025
@pierugo-dfinity pierugo-dfinity changed the title test(orchestrator): compare CUP state hash post-upgrade and latest computed root hash pre-upgrade test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade Nov 19, 2025
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review November 20, 2025 16:12
@pierugo-dfinity pierugo-dfinity requested review from a team as code owners November 20, 2025 16:12
pierugo-dfinity added a commit that referenced this pull request Nov 26, 2025
…it (#7487)

This PR logs more useful information (especially the state hash) about
the local CUP just before persisting it in the orchestrator.

This is useful in cases where the orchestrator breaks after an upgrade
which would prevent from provisioning readonly SSH keys to recover the
subnet. In that case, there is no easy way to know the latest state hash
to be included in the recovery CUP except from hoping that the recovery
operator's node is up to date. Logging information about the CUP just
before rebooting removes this requirement, as long as the latest logs
were scraped before the node reboots.

Edit: Following the PR comments, the original solution suffered that it
could be possible that the logs were not scraped before rebooting if the
node reboots too fast. Since the state hash is logged by the state
manager anyways before actually creating the CUP, we can rely on this
log instead. The original twin
[PR](#7525) intended to test the
functionality now relies on the log from the state manager, preventing
it to be removed in the future, and is now also open since we do not
need to wait for the current PR to be merged to mainnet NNS. The two PRs
are independent.

Still, including the state hash in the orchestrator cannot hurt and this
PR does just that.

About the original sleep of 2 seconds at the end of the orchestrator to
let Vector scrape late logs, there may be a way to persist logs before
rebooting and ask `systemd-journal-gatewayd` to serve logs from the
previous boot but I do not think it is worth the effort (we would need
to change the Vector configs f.ex.) just to see a few lines of logs
missing.
@pierugo-dfinity pierugo-dfinity requested a review from a team as a code owner November 28, 2025 12:58
mraszyk pushed a commit that referenced this pull request Dec 1, 2025
…it (#7487)

This PR logs more useful information (especially the state hash) about
the local CUP just before persisting it in the orchestrator.

This is useful in cases where the orchestrator breaks after an upgrade
which would prevent from provisioning readonly SSH keys to recover the
subnet. In that case, there is no easy way to know the latest state hash
to be included in the recovery CUP except from hoping that the recovery
operator's node is up to date. Logging information about the CUP just
before rebooting removes this requirement, as long as the latest logs
were scraped before the node reboots.

Edit: Following the PR comments, the original solution suffered that it
could be possible that the logs were not scraped before rebooting if the
node reboots too fast. Since the state hash is logged by the state
manager anyways before actually creating the CUP, we can rely on this
log instead. The original twin
[PR](#7525) intended to test the
functionality now relies on the log from the state manager, preventing
it to be removed in the future, and is now also open since we do not
need to wait for the current PR to be merged to mainnet NNS. The two PRs
are independent.

Still, including the state hash in the orchestrator cannot hurt and this
PR does just that.

About the original sleep of 2 seconds at the end of the orchestrator to
let Vector scrape late logs, there may be a way to persist logs before
rebooting and ask `systemd-journal-gatewayd` to serve logs from the
previous boot but I do not think it is worth the effort (we would need
to change the Vector configs f.ex.) just to see a few lines of logs
missing.
@pierugo-dfinity pierugo-dfinity added this pull request to the merge queue Dec 2, 2025
Merged via the queue into master with commit ba26afe Dec 2, 2025
66 of 67 checks passed
@pierugo-dfinity pierugo-dfinity deleted the pierugo/orchestrator/test-latest-local-cup-log branch December 2, 2025 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 @consensus @idx @team-dsm test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants