Skip to content

fix: backward-forward-let#1570

Open
arnaubennassar wants to merge 9 commits intodevelopfrom
feat/imporve-bw-fw-lettool
Open

fix: backward-forward-let#1570
arnaubennassar wants to merge 9 commits intodevelopfrom
feat/imporve-bw-fw-lettool

Conversation

@arnaubennassar
Copy link
Copy Markdown
Collaborator

@arnaubennassar arnaubennassar commented Apr 1, 2026

🔄 Changes Summary

  • Improve aggsender startup recovery so stale local certificates no longer block progress: when local state is ahead of or conflicts with AggLayer, recovery now treats AggLayer as the source of truth, deletes stale local certs, and reinserts the correct settled/latest cert state when needed.
  • Extend backward-forward-let with staging-drill tooling:
    • add a new craft-cert command to build and sign malicious certificates for controlled staging exercises,
    • allow send-cert to run with --no-db so a cert can be sent to AggLayer without persisting it in the aggsender DB,
    • improve craft-cert fallback behavior so it can reconstruct historical leaves from aggsender RPC, aggsender DB, bridge service, and certificate-exit override data.
  • Update the backward/forward LET documentation to match the implemented operator workflow and staging validation learnings, including the manual fallback procedure when aggsender cannot provide certificate bridge exits.

⚠️ Breaking Changes

None

📋 Config Updates

None

✅ Testing

  • 🤖 Automatic:
    • go test ./aggsender/statuschecker
    • go test ./tools/backward_forward_let
  • 🖱️ Manual:
    • Reviewed the develop...feat/imporve-bw-fw-lettool diff against the updated docs/runbook to ensure the documented recovery and staging-drill
      flows match the implemented CLI behavior.
    • Verified the PR now covers more than docs only: aggsender recovery behavior, BFL CLI additions, fallback handling, and docs.
    • Verified on a staging network

🐞 Issues

  • Closes #none

🔗 Related PRs

  • None

📝 Notes

  • The aggsender recovery change is intentionally more permissive on startup: conflicting/stale local certs are deleted instead of causing recovery to fail, with AggLayer treated as authoritative.
  • craft-cert is explicitly gated behind --staging-only and is intended for staging drills, not production operation.
  • The missing-certificate override flow remains documented because recovery still depends on operator-supplied bridge exits when aggsender RPC/DB cannot provide them.

@arnaubennassar arnaubennassar marked this pull request as draft April 14, 2026 13:25
@arnaubennassar arnaubennassar changed the title docs: update backward-forward-let runbook fix: backward-forward-let Apr 16, 2026
@agglayer agglayer deleted a comment from claude bot Apr 16, 2026
@arnaubennassar arnaubennassar marked this pull request as ready for review April 16, 2026 14:51
@sonarqubecloud
Copy link
Copy Markdown

@joanestebanr joanestebanr requested a review from Copilot April 16, 2026 15:37
@joanestebanr
Copy link
Copy Markdown
Collaborator

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 16, 2026

Claude finished @joanestebanr's task in 2m 5s —— View job


Code Review for PR #1570

🚀 What's New

  • Improved aggsender recovery startup logic: when local state conflicts with AggLayer, the tool now treats AggLayer as authoritative and automatically deletes stale local certificates instead of blocking recovery
  • Extended backward-forward-let CLI with new staging-drill capabilities:
    • craft-cert command to build and sign malicious certificates for controlled testing
    • --no-db flag for send-cert to send certificates without persisting to aggsender DB
    • Enhanced fallback reconstruction from aggsender RPC, aggsender DB, bridge service, and override files
  • Comprehensive documentation update: converted large runbook into focused operator workflow, added detailed README and recovery procedure

🐛 Bug Fixes

None

📋 Config Updates

None

⚠️ Breaking Changes

None


Review Findings

🔴 Required Issues

1. Security: Missing rate limiting validation in craft-cert RPC retry logic
tools/backward_forward_let/craft_cert.go:453-505

The retry logic for fetching certificate data has exponential backoff but continues retrying for up to 6 attempts even when receiving non-retryable errors. After the first non-retryable error check at line 466, the code falls through to call GetCertificateBridgeExits which can also fail with non-retryable errors, but then the retry loop continues anyway.

if !isRetryableCraftCertFetchError(err) && !isRetryableCraftCertFetchError(lastErr) {
    return nil, lastErr
}

This condition at line 491 checks both errors, but if headerErr was non-retryable and we continued past line 478, we lose that information and may retry unnecessarily.

Recommendation: Store whether a non-retryable error was encountered and break immediately rather than continuing the loop.

2. Potential data loss in recovery: Delete operation uses MaybeDelete
aggsender/statuschecker/cert_status_checker.go:284

if err := c.storage.DeleteCertificate(nil, action.height, db.MaybeDelete); err != nil {

The DeleteCertificate call uses db.MaybeDelete mode, which may silently skip the deletion if the certificate doesn't exist. This could mask issues where the recovery logic expects a certificate to be present but it's already gone, potentially leaving the system in an inconsistent state.

Recommendation: Consider using a stricter delete mode that returns an error if the certificate doesn't exist, or add explicit logging when MaybeDelete skips a deletion.


🟡 Suggested Improvements

3. Code Quality: Inconsistent error handling in getStoredBridgeExitsForHeight
tools/backward_forward_let/craft_cert.go:452-508

The function has complex nested retry logic with multiple fallback paths. The flow between attempting GetCertificateHeaderPerHeight (with retries) and falling back to GetCertificateBridgeExits (also with retries) is hard to follow. Both paths share the same backoff timer and attempt counter.

Recommendation: Refactor into separate functions for each fallback source (RPC header, RPC exits) with their own retry logic. This would improve maintainability and make the fallback chain more explicit.

4. Performance: Redundant L2 bridge calls in craft-cert
tools/backward_forward_let/craft_cert.go:387-409

The currentBridgeMatchesSettled function makes two on-chain calls (GetRoot and DepositCount) just to check if we can use bridge service as a shortcut. If the check fails, we then proceed to make additional RPC calls. For networks with high certificate volumes, this adds latency.

Recommendation: Consider caching the bridge state or adding a CLI flag to skip this optimization check when it's known the bridge has diverged.

5. Test Coverage: Missing test for InError certificate with existing InError certs
aggsender/statuschecker/cert_status_checker.go:163-172

The new code at lines 163-172 handles the case where there are no pending certs but InError certs exist that need retry. However, cert_status_checker_test.go doesn't have a specific test case for this scenario.

Recommendation: Add a test case that verifies the behavior when checkPendingCertificatesStatus finds no pending certs but discovers InError certs that should trigger a retry.

6. Documentation: craft-cert command security warning could be more prominent
tools/backward_forward_let/craft_cert.go:59-61

The --staging-only acknowledgement requirement is good, but the security implications of crafting malicious certificates could be stated more explicitly in both the code comment and the README.

Recommendation: Add a prominent warning comment block above RunCraftCert explaining why this command is dangerous and should only be used in controlled staging environments.


🟢 Optional Enhancements

7. Code Style: Magic number for L1InfoTreeLeafCount default
tools/backward_forward_let/craft_cert.go:31

const defaultL1InfoTreeLeafCount uint32 = 1

This constant is used as a fallback when the L1InfoTreeLeafCount cannot be determined from existing certificates. A comment explaining why 1 is the appropriate default would be helpful.

8. Naming: naAgglayerHeader variable naming
aggsender/statuschecker/cert_status_checker.go:17,332

var naAgglayerHeader = "na/agglayer header"

The variable name naAgglayerHeader is cryptic. Consider renaming to something more descriptive like agglayerSourcedCertificatePlaceholder or missingSignedCertificateMarker.

9. Error messages: Add context about AggLayer as source of truth
aggsender/statuschecker/initial_state.go:219-232

The warning messages correctly state "agglayer is the source of truth," but the regular message doesn't explain why we're deleting the local cert. Consider enhancing the non-warning message to clarify this is expected recovery behavior.


Test Coverage Analysis

Excellent test coverage for the new functionality:

  • initial_state_test.go: Comprehensive test suite covering 17+ recovery scenarios including inconsistencies, edge cases, and multi-certificate jumps
  • craft_cert_test.go: 14 test cases covering certificate crafting, fallback mechanisms, retries, and RPC timeout handling
  • send_cert_test.go: 13 test cases covering send certificate logic, error paths, and DB storage

The tests use proper mocking and cover both happy paths and error scenarios. SonarQube reports 86.3% coverage on new code, which is excellent.

Minor gap: The additional InError certificate detection logic (lines 163-172 in cert_status_checker.go) could use a dedicated test case.


Documentation Quality

Significantly improved documentation structure:

  • Old runbook (918 lines) reduced to focused operator workflow (147 lines)
  • New comprehensive README.md (313 lines) with clear command documentation
  • Detailed RECOVERY_PROCEDURE.md (271 lines) for fallback scenarios
  • Documentation matches the implemented behavior

The new structure separates concerns better: the runbook focuses on "when and why," the README on "what and how," and the recovery procedure on "manual fallbacks."


Security Considerations

🔒 The craft-cert command is properly gated behind --staging-only acknowledgement flag, but consider:

  1. Adding rate limiting or additional safeguards to prevent accidental production use
  2. Logging all craft-cert invocations to an audit trail
  3. Adding a config-level feature flag to disable craft-cert entirely in production builds

Architecture Notes

The recovery logic changes represent a philosophical shift: AggLayer is now the authoritative source of truth. This is the correct approach and aligns with the system's intended design. The old behavior (failing recovery on conflict) was too conservative and left operators in a stuck state.

The new "delete stale local certs" behavior is more operationally friendly but does mean that local state can be silently overwritten. This is acceptable given that:

  1. Warnings are logged when this happens
  2. AggLayer state is what determines settlement on L1
  3. The alternative (manual intervention) was error-prone

Summary

This is a well-implemented PR with strong test coverage and significantly improved documentation. The core functionality correctly implements the "AggLayer as source of truth" recovery model. The main concerns are:

  1. 🔴 Retry logic in craft-cert could be tightened
  2. 🔴 Delete mode in recovery should be validated more strictly
  3. 🟡 Refactoring opportunities in the fallback chain
  4. 🟡 Minor test gaps for edge cases

The changes are production-ready with the recommended fixes for the 🔴 items.


@joanestebanr
Copy link
Copy Markdown
Collaborator

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens AggSender startup recovery against stale/conflicting local certificate state (treating AggLayer as authoritative), and extends the backward-forward-let tooling/runbooks to support staging drills (including crafting and sending certificates without persisting them locally).

Changes:

  • AggSender startup recovery now deletes/replaces conflicting local cert entries instead of failing, syncing local state back to AggLayer.
  • backward-forward-let adds craft-cert (staging-gated) and enhances send-cert with --no-db; diagnosis/reporting logic is refined for partial results.
  • Documentation/runbooks are updated to reflect the operator workflow and fallback procedures when aggsender exit data is unavailable.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/backward_forward_let/types.go Adds a helper for “complete + no divergence” diagnosis checks.
tools/backward_forward_let/send_cert.go Adds --no-db path to send certs without persisting in aggsender DB.
tools/backward_forward_let/send_cert_test.go Adds coverage for --no-db and CLI flag wiring.
tools/backward_forward_let/run.go Uses the new “complete no divergence” helper to avoid false “nothing to do”.
tools/backward_forward_let/diagnosis.go Adjusts printing flow and changes bridge-service NotFound handling to fail fast.
tools/backward_forward_let/diagnosis_test.go Adds tests for partial results and updates bridge-service NotFound expectations.
tools/backward_forward_let/craft_cert.go Introduces craft-cert command implementation and multi-source exit reconstruction logic.
tools/backward_forward_let/craft_cert_test.go Adds comprehensive tests for crafted cert behavior and fallback reconstruction paths.
tools/backward_forward_let/config.go Extends tool config to reuse AggSender.AggsenderPrivateKey for craft-cert signing.
tools/backward_forward_let/cmd/main.go Registers craft-cert and adds --no-db to send-cert.
tools/backward_forward_let/RECOVERY_PROCEDURE.md Documents admin API auth header option and operational staging notes.
tools/backward_forward_let/README.md Adds full tool/command documentation including staging drill flows.
docs/backward_forward_let_runbook.md Rewrites runbook to a tool-driven operator workflow and fallback steps.
aggsender/statuschecker/initial_state.go Adds delete-local-cert recovery actions and makes reconciliation permissive.
aggsender/statuschecker/initial_state_test.go Updates expectations to validate new recovery actions (including delete/insert sequencing).
aggsender/statuschecker/cert_status_checker.go Executes the new delete-local-cert action and surfaces warnings.
aggsender/statuschecker/cert_status_checker_test.go Adds coverage for executing delete-local-cert actions.
.gitignore Ignores debug/ and bin/ directories.
Comments suppressed due to low confidence (1)

tools/backward_forward_let/cmd/main.go:104

  • The send-cert command Usage string still says it records the cert in the aggsender DB, but this PR makes DB storage optional via --no-db. Update the help text to reflect the new behavior, and consider validating flag combinations (e.g., require exactly one of --db-path or --no-db, and error if both are set) to avoid surprising no-op/ignored flags.
			Name:  "send-cert",
			Usage: "Send a certificate to the agglayer and record it in the aggsender DB",
			Flags: []cli.Flag{
				&cli.StringFlag{
					Name:  "cert-json",
					Usage: "Certificate JSON string (mutually exclusive with --cert-file)",

return nil, fmt.Errorf("parse --amount %q as decimal big.Int", c.String("amount"))
}

originTokenAddr := common.HexToAddress(c.String("origin-token-address"))
Comment on lines +551 to +553
func makeFakeBridgeExit(opts *craftCertOptions, exitIndex int) *agglayertypes.BridgeExit {
addrBytes := crypto.Keccak256(append(append([]byte(nil), opts.nonce...), byte(exitIndex)))
return &agglayertypes.BridgeExit{
return nil, err
}
if cert == nil {
return nil, fmt.Errorf("certificate not found")
Comment on lines +510 to +528
func callCraftCertRPCWithTimeout[T any](fn func() (T, error)) (T, error) {
type result struct {
value T
err error
}

resultCh := make(chan result, 1)
go func() {
value, err := fn()
resultCh <- result{value: value, err: err}
}()

select {
case result := <-resultCh:
return result.value, result.err
case <-time.After(craftCertRPCRequestTimeout):
var zero T
return zero, fmt.Errorf("aggsender RPC request timed out after %s", craftCertRPCRequestTimeout)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants