-
Notifications
You must be signed in to change notification settings - Fork 371
test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
pierugo-dfinity
merged 47 commits into
master
from
pierugo/orchestrator/test-latest-local-cup-log
Dec 2, 2025
Merged
test(orchestrator): compare latest computed root hash pre-upgrade with CUP state hash post-upgrade #7525
pierugo-dfinity
merged 47 commits into
master
from
pierugo/orchestrator/test-latest-local-cup-log
Dec 2, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit a9e12a7.
889c068 to
8581809
Compare
…o/orchestrator/test-latest-local-cup-log
basvandijk
approved these changes
Nov 20, 2025
pierugo-dfinity
added a commit
that referenced
this pull request
Nov 26, 2025
…it (#7487) This PR logs more useful information (especially the state hash) about the local CUP just before persisting it in the orchestrator. This is useful in cases where the orchestrator breaks after an upgrade which would prevent from provisioning readonly SSH keys to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Logging information about the CUP just before rebooting removes this requirement, as long as the latest logs were scraped before the node reboots. Edit: Following the PR comments, the original solution suffered that it could be possible that the logs were not scraped before rebooting if the node reboots too fast. Since the state hash is logged by the state manager anyways before actually creating the CUP, we can rely on this log instead. The original twin [PR](#7525) intended to test the functionality now relies on the log from the state manager, preventing it to be removed in the future, and is now also open since we do not need to wait for the current PR to be merged to mainnet NNS. The two PRs are independent. Still, including the state hash in the orchestrator cannot hurt and this PR does just that. About the original sleep of 2 seconds at the end of the orchestrator to let Vector scrape late logs, there may be a way to persist logs before rebooting and ask `systemd-journal-gatewayd` to serve logs from the previous boot but I do not think it is worth the effort (we would need to change the Vector configs f.ex.) just to see a few lines of logs missing.
adambratschikaye
approved these changes
Dec 1, 2025
kpop-dfinity
reviewed
Dec 1, 2025
mraszyk
pushed a commit
that referenced
this pull request
Dec 1, 2025
…it (#7487) This PR logs more useful information (especially the state hash) about the local CUP just before persisting it in the orchestrator. This is useful in cases where the orchestrator breaks after an upgrade which would prevent from provisioning readonly SSH keys to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Logging information about the CUP just before rebooting removes this requirement, as long as the latest logs were scraped before the node reboots. Edit: Following the PR comments, the original solution suffered that it could be possible that the logs were not scraped before rebooting if the node reboots too fast. Since the state hash is logged by the state manager anyways before actually creating the CUP, we can rely on this log instead. The original twin [PR](#7525) intended to test the functionality now relies on the log from the state manager, preventing it to be removed in the future, and is now also open since we do not need to wait for the current PR to be merged to mainnet NNS. The two PRs are independent. Still, including the state hash in the orchestrator cannot hurt and this PR does just that. About the original sleep of 2 seconds at the end of the orchestrator to let Vector scrape late logs, there may be a way to persist logs before rebooting and ask `systemd-journal-gatewayd` to serve logs from the previous boot but I do not think it is worth the effort (we would need to change the Vector configs f.ex.) just to see a few lines of logs missing.
kpop-dfinity
approved these changes
Dec 1, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In cases where the orchestrator breaks after an upgrade, provisioning readonly SSH keys would not be possible to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Though, the state manager regularly logs the latest computed state hash and a recovery operator could look at logs of all nodes and use this information to build a recovery CUP.
This PR makes sure that this log is always observed from the log endpoint before an upgrade, and compares it with the state hash that the node reboots with. That way, we test that we can reliably observe this log before a node reboots and we make sure that the test would break if that log was ever removed in the future.
Furthermore, it always checks that the orchestrator gracefully exits. This was not the case because back when the test was added, the relevant log was not merged into mainnet yet, so we could only test it when downgrading HEAD to mainnet and not the opposite.