test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525

pierugo-dfinity · 2025-11-04T09:12:20Z

In cases where the orchestrator breaks after an upgrade, provisioning readonly SSH keys would not be possible to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Though, the state manager regularly logs the latest computed state hash and a recovery operator could look at logs of all nodes and use this information to build a recovery CUP.

This PR makes sure that this log is always observed from the log endpoint before an upgrade, and compares it with the state hash that the node reboots with. That way, we test that we can reliably observe this log before a node reboots and we make sure that the test would break if that log was ever removed in the future.

Furthermore, it always checks that the orchestrator gracefully exits. This was not the case because back when the test was added, the relevant log was not merged into mainnet yet, so we could only test it when downgrading HEAD to mainnet and not the opposite.

This reverts commit a9e12a7.

…t does so

…o/orchestrator/test-latest-local-cup-log

This reverts commit 3bb735e.

This reverts commit d0b8963.

…o/orchestrator/test-latest-local-cup-log

This reverts commit f4ac3b0.

This reverts commit 5c0a596.

…o/orchestrator/test-latest-local-cup-log

This reverts commit aebfb43.

rs/tests/consensus/upgrade/common.rs

…it (#7487) This PR logs more useful information (especially the state hash) about the local CUP just before persisting it in the orchestrator. This is useful in cases where the orchestrator breaks after an upgrade which would prevent from provisioning readonly SSH keys to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Logging information about the CUP just before rebooting removes this requirement, as long as the latest logs were scraped before the node reboots. Edit: Following the PR comments, the original solution suffered that it could be possible that the logs were not scraped before rebooting if the node reboots too fast. Since the state hash is logged by the state manager anyways before actually creating the CUP, we can rely on this log instead. The original twin [PR](#7525) intended to test the functionality now relies on the log from the state manager, preventing it to be removed in the future, and is now also open since we do not need to wait for the current PR to be merged to mainnet NNS. The two PRs are independent. Still, including the state hash in the orchestrator cannot hurt and this PR does just that. About the original sleep of 2 seconds at the end of the orchestrator to let Vector scrape late logs, there may be a way to persist logs before rebooting and ask `systemd-journal-gatewayd` to serve logs from the previous boot but I do not think it is worth the effort (we would need to change the Vector configs f.ex.) just to see a few lines of logs missing.

…-log

rs/tests/consensus/upgrade/common.rs

…it (#7487) This PR logs more useful information (especially the state hash) about the local CUP just before persisting it in the orchestrator. This is useful in cases where the orchestrator breaks after an upgrade which would prevent from provisioning readonly SSH keys to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Logging information about the CUP just before rebooting removes this requirement, as long as the latest logs were scraped before the node reboots. Edit: Following the PR comments, the original solution suffered that it could be possible that the logs were not scraped before rebooting if the node reboots too fast. Since the state hash is logged by the state manager anyways before actually creating the CUP, we can rely on this log instead. The original twin [PR](#7525) intended to test the functionality now relies on the log from the state manager, preventing it to be removed in the future, and is now also open since we do not need to wait for the current PR to be merged to mainnet NNS. The two PRs are independent. Still, including the state hash in the orchestrator cannot hurt and this PR does just that. About the original sleep of 2 seconds at the end of the orchestrator to let Vector scrape late logs, there may be a way to persist logs before rebooting and ask `systemd-journal-gatewayd` to serve logs from the previous boot but I do not think it is worth the effort (we would need to change the Vector configs f.ex.) just to see a few lines of logs missing.

rs/tests/consensus/upgrade/common.rs

pierugo-dfinity added 6 commits October 30, 2025 13:54

chore(orchestrator): add local CUP state hash metric

a9e12a7

Revert "chore(orchestrator): add local CUP state hash metric"

3408fd9

This reverts commit a9e12a7.

chore: log local CUP info

6f87909

style: replace block_on with .await

8e151c1

test: compare CUP state hashes before/after reboot

38a207e

chore: bazel

9da75d4

github-actions bot added the test label Nov 4, 2025

REVERT ME: enable tests

8581809

pierugo-dfinity force-pushed the pierugo/orchestrator/test-latest-local-cup-log branch from 889c068 to 8581809 Compare November 4, 2025 09:14

pierugo-dfinity mentioned this pull request Nov 4, 2025

chore(orchestrator): log more info about local CUP before persisting it #7487

Merged

pierugo-dfinity added 7 commits November 4, 2025 12:19

refactor: always check orchestrator exits gracefully + stream until i…

794a8ca

…t does so

style

92c27b9

refactor: reword old log and move confirmation log

0f99dee

refactor: move sleep at the end of orchestrator stopping

58badb0

Merge branch 'pierugo/orchestrator/local-cup-hash-metric' into pierug…

71afd84

…o/orchestrator/test-latest-local-cup-log

fix: adapt log format

dc1c9fb

REVERT ME: stop gatewayd after orchestrator

aebfb43

pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Nov 4, 2025

pierugo-dfinity added 12 commits November 4, 2025 17:05

style

2cdeb9c

chore: remove sleep

d0b8963

fix: clippy

3bb735e

Merge branch 'pierugo/orchestrator/local-cup-hash-metric' into pierug…

60f9f5e

…o/orchestrator/test-latest-local-cup-log

Revert "fix: clippy"

5c0a596

This reverts commit 3bb735e.

Revert "chore: remove sleep"

f4ac3b0

This reverts commit d0b8963.

Merge branch 'pierugo/orchestrator/local-cup-hash-metric' into pierug…

ac91351

…o/orchestrator/test-latest-local-cup-log

Reapply "chore: remove sleep"

bf7b931

This reverts commit f4ac3b0.

Reapply "fix: clippy"

4e6dc25

This reverts commit 5c0a596.

Merge branch 'master' into pierugo/orchestrator/local-cup-hash-metric

0dd87af

Merge branch 'pierugo/orchestrator/local-cup-hash-metric' into pierug…

38c8348

…o/orchestrator/test-latest-local-cup-log

Revert "REVERT ME: stop gatewayd after orchestrator"

df69b2b

This reverts commit aebfb43.

fix: read local CUP as soon as possible

1a767a7

pierugo-dfinity changed the title ~~test(orchestrator): compare CUP state hash post-upgrade and latest computed root hash pre-upgrade~~ test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade Nov 19, 2025

pierugo-dfinity added 2 commits November 20, 2025 08:53

fix: clippy

f64efef

docs

e84a0fe

pierugo-dfinity marked this pull request as ready for review November 20, 2025 16:12

pierugo-dfinity requested review from a team as code owners November 20, 2025 16:12

github-actions bot added @consensus @idx labels Nov 20, 2025

basvandijk approved these changes Nov 20, 2025

View reviewed changes

kpop-dfinity reviewed Nov 21, 2025

View reviewed changes

rs/tests/consensus/upgrade/common.rs Outdated Show resolved Hide resolved

rs/tests/consensus/upgrade/common.rs Outdated Show resolved Hide resolved

rs/tests/consensus/upgrade/common.rs Outdated Show resolved Hide resolved

pierugo-dfinity added 2 commits November 21, 2025 13:06

docs: add comment to prevent log from being removed

a1be5a0

style: meaningful variable names

ddf7f34

pierugo-dfinity added 2 commits November 28, 2025 12:58

fix: observe n - f logs with same hash

77b1cde

fix: retry reading CUP

b5ae3e7

pierugo-dfinity requested a review from a team as a code owner November 28, 2025 12:58

github-actions bot added the @team-dsm label Nov 28, 2025

Merge branch 'master' into pierugo/orchestrator/test-latest-local-cup…

7ff3422

…-log

pierugo-dfinity requested a review from kpop-dfinity November 28, 2025 13:04

adambratschikaye approved these changes Dec 1, 2025

View reviewed changes

kpop-dfinity reviewed Dec 1, 2025

View reviewed changes

rs/tests/consensus/upgrade/common.rs Outdated Show resolved Hide resolved

fix: do not compare hash with post-reboot CUP

788ac10

kpop-dfinity approved these changes Dec 1, 2025

View reviewed changes

rs/tests/consensus/upgrade/common.rs Outdated Show resolved Hide resolved

docs

3be34f4

pierugo-dfinity added this pull request to the merge queue Dec 2, 2025

Merged via the queue into master with commit ba26afe Dec 2, 2025
66 of 67 checks passed

pierugo-dfinity deleted the pierugo/orchestrator/test-latest-local-cup-log branch December 2, 2025 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525

test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525

Uh oh!

pierugo-dfinity commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525

test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525

Uh oh!

Conversation

pierugo-dfinity commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pierugo-dfinity commented Nov 4, 2025 •

edited

Loading