Skip to content

Conversation

@pthmas
Copy link
Contributor

@pthmas pthmas commented Jan 16, 2026

Overview

  • Adds a pruning mechanism to the store:
    • Prunes /h/{height}, /d/{height}, /c/{height}, /i/{hash} up to a target height.
    • Leaves /s/{height} (state) and /t (current height) untouched.
    • Tracks progress via a metadata key last-pruned-block-height so pruning is monotonic and idempotent.
  • Integrates pruning into block production:
    • After a block is committed, at configurable intervals, the executor computes a prune target and calls store.PruneBlocks synchronously.
  • Introduces a Pruner interface (PruneBlocks).

Config

New node config fields / flags:

  • --evnode.node.pruning_enabled
  • --evnode.node.pruning_keep_recent
  • --evnode.node.pruning_interval

Pruning actually runs only when all three are set to non-trivial values (enabled, keep_recent > 0, interval > 0).

Design trade-offs

  • Runs in the critical path:
    PruneBlocks is called synchronously in produceBlock. On pruning heights, block production will incur extra latency proportional to the amount of data pruned and underlying storage performance.

  • Best-effort pruning:
    If pruning fails, we log the error but do not fail the block. This avoids halting the node but means disk usage can grow silently if pruning keeps failing; operators will need to watch logs/metrics.

@github-actions
Copy link
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 16, 2026, 9:38 AM

@claude
Copy link
Contributor

claude bot commented Jan 16, 2026

Claude finished @pthmas's task in 2m 5s —— View job


Comprehensive Code Review

I've completed a detailed review of the block pruning implementation. Overall, this is a well-structured implementation with good test coverage. Below are my findings organized by category, with critical issues highlighted first.


Critical Issues

1. Race Condition Risk in PruneBlocks (store.go:283-354)

Severity: High
Location: pkg/store/store.go:283-354

The PruneBlocks method runs synchronously in the critical path during block production (executor.go:443), but there's no explicit locking to prevent concurrent reads while pruning is happening. If another goroutine attempts to read block data that's currently being pruned, it could get inconsistent results.

Issue:

  • PruneBlocks creates a batch and deletes multiple keys (lines 309-341)
  • Between batch creation and commit, other goroutines could still read the data
  • The datastore's Batch interface doesn't guarantee isolation until commit

Recommendation:
Add documentation clarifying the concurrency assumptions, or consider adding a read-write lock if concurrent access is possible. At minimum, add a comment explaining that the caller must ensure no concurrent reads to heights being pruned.

// PruneBlocks removes block data (header, data, signature, and hash index)
// up to and including the given height from the store.
//
// CONCURRENCY: This method assumes exclusive access to the store during pruning.
// Callers must ensure no concurrent reads/writes occur for heights being pruned.

High-Priority Issues

2. Silent Failure in Best-Effort Pruning (executor.go:443-446)

Severity: High
Location: block/internal/executing/executor.go:443-446

The PR description states: "If pruning fails, we log the error but do not fail the block." This is a double-edged sword:

Concerns:

  • Repeated pruning failures will cause unbounded disk growth
  • Only logged at ERROR level with no metrics/alerting
  • No circuit breaker to disable pruning after repeated failures
  • Operators may not notice until disk is full

Recommendation:

  1. Add a metric to track pruning failures (e.g., pruning_failures_total)
  2. Consider adding a circuit breaker after N consecutive failures
  3. Add periodic warnings in logs if pruning has been failing for extended time
if err := e.store.PruneBlocks(e.ctx, targetHeight); err != nil {
    e.logger.Error().Err(err).Uint64("target_height", targetHeight).Msg("failed to prune old block data")
    e.metrics.PruningFailures.Inc() // Add this metric
    // Consider: disable pruning after N consecutive failures?
}

3. Missing Validation for Pruning Configuration (config.go:251-292)

Severity: Medium-High
Location: pkg/config/config.go:251-292

The Validate() method doesn't validate the pruning configuration parameters. Several edge cases are not caught:

Missing Validations:

  • PruningInterval can be 0 even when PruningEnabled is true
  • PruningKeepRecent of 1 is dangerous (would prune almost everything)
  • No validation that PruningKeepRecent is reasonable relative to DA submission delays

Recommendation:
Add validation in config.go:251-292:

// Validate pruning configuration
if c.Node.PruningEnabled {
    if c.Node.PruningInterval == 0 {
        return fmt.Errorf("pruning_interval must be >= 1 when pruning is enabled")
    }
    if c.Node.PruningKeepRecent == 0 {
        return fmt.Errorf("pruning_keep_recent must be > 0 when pruning is enabled (or disable pruning)")
    }
    if c.Node.PruningKeepRecent < 10 {
        return fmt.Errorf("pruning_keep_recent must be >= 10 to avoid premature data loss")
    }
}

4. Incomplete State Pruning Documentation (types.go:85-94)

Severity: Medium
Location: pkg/store/types.go:85-94

The Pruner interface documentation states it doesn't modify state snapshots, but this creates a significant disk usage issue that's not clearly communicated:

Issue:

  • Block data (headers, data, signatures) gets pruned
  • State snapshots at /s/{height} are NOT pruned
  • Over time, state snapshots will dominate disk usage
  • No mechanism exists to prune old state snapshots

Recommendation:

  1. Update documentation to explicitly warn about state snapshot accumulation
  2. Consider future work to add state snapshot pruning (separate from block pruning)
  3. Add comments in executor.go explaining why state is not pruned

Medium-Priority Issues

5. Error Handling Inconsistency in PruneBlocks (store.go:324-341)

Severity: Medium
Location: pkg/store/store.go:324-341

Error handling for batch deletions is inconsistent:

if err := batch.Delete(ctx, ds.NewKey(getDataKey(h))); err != nil {
    if !errors.Is(err, ds.ErrNotFound) {  // Ignores NotFound
        return fmt.Errorf("failed to delete data at height %d during pruning: %w", h, err)
    }
}

Issue:

  • Header deletion (line 320) returns error immediately, not checking for ErrNotFound
  • Data, signature, and index deletions (lines 324-341) ignore ErrNotFound
  • This inconsistency suggests header is assumed to always exist, but that's not documented

Recommendation:
Make error handling consistent and document assumptions:

// Get header blob to compute the hash index key. If header is already
// missing (e.g. due to previous partial pruning), just skip this height.
headerBlob, err := s.db.Get(ctx, ds.NewKey(getHeaderKey(h)))
if err != nil {
    if errors.Is(err, ds.ErrNotFound) {
        continue  // Already pruned or never existed
    }
    return fmt.Errorf("failed to get header at height %d during pruning: %w", h, err)
}

6. Synchronous Pruning Adds Latency to Block Production (executor.go:433-448)

Severity: Medium
Location: block/internal/executing/executor.go:433-448

The PR description acknowledges this: "PruneBlocks is called synchronously in produceBlock. On pruning heights, block production will incur extra latency."

Quantification:

  • If pruning 1000 blocks with 4 keys each (header, data, signature, index) = 4000 deletions
  • Even at 1ms per batch operation, that's 4+ seconds of latency
  • This could cause block time violations and consensus issues

Recommendation:

  1. Add timing metrics specifically for pruning operations
  2. Consider making pruning asynchronous with a background goroutine
  3. Add configuration to limit batch size (prune at most N blocks per interval)
// Limit pruning batch size to avoid blocking too long
const maxPrunePerInterval = 100
if newHeight > e.config.Node.PruningKeepRecent {
    targetHeight := newHeight - e.config.Node.PruningKeepRecent
    
    // Get last pruned height
    lastPruned := e.getLastPrunedHeight(ctx)
    
    // Limit pruning to avoid long blocking
    if targetHeight - lastPruned > maxPrunePerInterval {
        targetHeight = lastPruned + maxPrunePerInterval
    }
    
    if err := e.store.PruneBlocks(e.ctx, targetHeight); err != nil {
        e.logger.Error().Err(err).Uint64("target_height", targetHeight).Msg("failed to prune old block data")
    }
}

7. Missing Test for Partial Pruning Failures (store_test.go)

Severity: Medium
Location: pkg/store/store_test.go

The test TestPruneBlocks_RemovesOldBlockDataOnly (line 737) tests happy path, but doesn't test:

  • What happens if batch.Commit() fails mid-way?
  • What if some deletions succeed but others fail?
  • What if the metadata update fails but deletions succeed?

Recommendation:
Add a test case using the mock datastore to simulate partial failures:

func TestPruneBlocks_PartialFailure(t *testing.T) {
    // Test that pruning is idempotent when partially completed
    // Simulate failure after some deletions but before commit
}

Low-Priority Issues

8. Magic Number for Height Length (store.go:356)

Severity: Low
Location: pkg/store/store.go:356

const heightLength = 8

This is fine, but could benefit from a comment explaining why 8 (uint64 encoding).

Recommendation:

const heightLength = 8 // size of uint64 in bytes

9. Typo in PR Title

Severity: Low
Location: PR Title

"Block Prunning" should be "Block Pruning" (single 'n')


Code Quality & Architecture

Positive Aspects ✅

  1. Clean Interface Design: The Pruner interface is well-defined and properly separated from Reader and Rollback
  2. Good Idempotency: The implementation properly tracks LastPrunedBlockHeightKey to make pruning idempotent
  3. Proper Metadata Tracking: Using metadata keys for tracking pruning progress is the right approach
  4. Comprehensive Tests: Good test coverage including edge cases for rollback with DA height validation

Architecture Concerns

Monotonic Pruning Assumption:
The design assumes pruning always moves forward (monotonic). This is correct, but consider documenting what happens if an operator wants to "undo" pruning (e.g., restore from backup with higher blocks). The metadata would need manual cleanup.


Performance & Resource Efficiency

Batch Operation Efficiency

The implementation batches all deletions for a height range, which is good. However:

Observation:

  • Large pruning ranges (e.g., initial pruning of 10,000 blocks) create very large batches
  • Some datastore implementations may have batch size limits
  • No backpressure mechanism if batch becomes too large

Recommendation:
Consider chunking large pruning operations:

const maxPruneBlocksPerBatch = 1000

for h := lastPruned + 1; h <= height; h++ {
    // ... deletion logic ...
    
    // Commit every N blocks to avoid huge batches
    if (h - lastPruned) % maxPruneBlocksPerBatch == 0 {
        if err := batch.Commit(ctx); err != nil {
            return fmt.Errorf("failed to commit pruning batch: %w", err)
        }
        // Create new batch for next chunk
        batch, err = s.db.Batch(ctx)
        if err != nil {
            return fmt.Errorf("failed to create new batch: %w", err)
        }
    }
}

Security

No Critical Security Issues Found ✅

The pruning implementation doesn't introduce security vulnerabilities:

  • Properly validates height parameters
  • Uses safe metadata keys
  • Doesn't expose sensitive data
  • Correctly preserves state data (not pruned)

Minor Security Note

The silent failure mode (best-effort pruning) could theoretically be exploited if an attacker can cause persistent pruning failures, leading to disk exhaustion DoS. However, this would require the attacker to already have compromised the datastore layer, which is a more fundamental breach.


Testing & Reliability

Test Coverage Assessment

Well-Covered:

  • ✅ Basic pruning removes old blocks
  • ✅ State snapshots are preserved
  • ✅ Idempotency (calling with same height is no-op)
  • ✅ Handling of missing headers during pruning

Missing Coverage:

  1. ❌ Pruning interval logic in executor (lines 437-448)
  2. ❌ Concurrent read during pruning
  3. ❌ Large range pruning (10,000+ blocks)
  4. ❌ Batch commit failures
  5. ❌ Metrics recording for pruning operations

Recommendation:
Add integration test in test/integration/ that:

  • Produces 1000 blocks
  • Enables pruning with keep_recent=100, interval=50
  • Verifies pruning happens at correct intervals
  • Verifies blocks are actually deleted
  • Measures pruning latency impact

Documentation & Developer Experience

Documentation Gaps

  1. No operator documentation: How should operators choose pruning_keep_recent value? What are the disk space implications?

  2. No migration guide: What happens when enabling pruning on existing nodes?

  3. No metrics documentation: What metrics should operators monitor?

Recommendation:
Add operator guide section to CLAUDE.md or create docs/pruning.md:

## Block Pruning Configuration

### Choosing pruning_keep_recent

The `pruning_keep_recent` value determines how many recent blocks to keep:

- Minimum: 10 blocks (safety margin)
- Recommended: 1000 blocks for full nodes
- For archive nodes: Disable pruning

Consider:
- DA submission delays (blocks must exist until DA confirmed)
- Peer sync requests (peers may request recent blocks)
- RPC query requirements

### Enabling Pruning on Existing Nodes

1. Stop the node
2. Edit config to set:
   - pruning_enabled: true
   - pruning_keep_recent: 1000
   - pruning_interval: 100
3. Restart node
4. First pruning will happen at next interval

Summary & Recommendations

Must Fix Before Merge

  1. ⚠️ Add pruning configuration validation (Issue Transaction gossiping #3)
  2. ⚠️ Add pruning failure metrics (Issue Peer discovery #2)
  3. ⚠️ Document concurrency assumptions (Issue ABCI interface #1)

Should Fix Before Merge

  1. Add batch size limiting to prevent huge transactions (Performance section)
  2. Add integration test for pruning interval logic
  3. Fix error handling inconsistency (Issue Mempool #5)

Nice to Have

  1. Add operator documentation
  2. Add timing metrics for pruning operations
  3. Consider async pruning in future PR
  4. Fix typo in PR title

Overall Assessment

This is a solid implementation of a much-needed feature. The code quality is good, the interface design is clean, and there's reasonable test coverage. The main concerns are:

  1. Operational safety: Silent failures could lead to disk exhaustion
  2. Performance impact: Synchronous pruning adds latency to block production
  3. Configuration validation: Edge cases not prevented

With the recommended fixes, particularly around validation and metrics, this will be production-ready.

Recommendation: Request changes for critical issues (#1-3), then approve once addressed.


Copy link
Member

@julienrbrt julienrbrt Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have a lot of different stores and keys. Are we planning to prune as well:

(and eventually ev-abci store)

It will be inconsistent if we don't allow height pruning of all those stores (at once).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants