Block Prunning #2984

pthmas · 2026-01-16T09:37:48Z

Overview

Adds a pruning mechanism to the store:
- Prunes /h/{height}, /d/{height}, /c/{height}, /i/{hash} up to a target height.
- Leaves /s/{height} (state) and /t (current height) untouched.
- Tracks progress via a metadata key last-pruned-block-height so pruning is monotonic and idempotent.
Integrates pruning into block production:
- After a block is committed, at configurable intervals, the executor computes a prune target and calls store.PruneBlocks synchronously.
Introduces a Pruner interface (PruneBlocks).

Config

New node config fields / flags:

--evnode.node.pruning_enabled
--evnode.node.pruning_keep_recent
--evnode.node.pruning_interval

Pruning actually runs only when all three are set to non-trivial values (enabled, keep_recent > 0, interval > 0).

Design trade-offs

Runs in the critical path:
PruneBlocks is called synchronously in produceBlock. On pruning heights, block production will incur extra latency proportional to the amount of data pruned and underlying storage performance.
Best-effort pruning:
If pruning fails, we log the error but do not fail the block. This avoids halting the node but means disk usage can grow silently if pruning keeps failing; operators will need to watch logs/metrics.

github-actions · 2026-01-16T09:38:04Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 16, 2026, 9:38 AM

claude · 2026-01-16T09:38:05Z

Claude finished @pthmas's task in 2m 5s —— View job

Comprehensive Code Review

I've completed a detailed review of the block pruning implementation. Overall, this is a well-structured implementation with good test coverage. Below are my findings organized by category, with critical issues highlighted first.

Critical Issues

1. Race Condition Risk in PruneBlocks (store.go:283-354)

Severity: High
Location: pkg/store/store.go:283-354

The PruneBlocks method runs synchronously in the critical path during block production (executor.go:443), but there's no explicit locking to prevent concurrent reads while pruning is happening. If another goroutine attempts to read block data that's currently being pruned, it could get inconsistent results.

Issue:

PruneBlocks creates a batch and deletes multiple keys (lines 309-341)
Between batch creation and commit, other goroutines could still read the data
The datastore's Batch interface doesn't guarantee isolation until commit

Recommendation:
Add documentation clarifying the concurrency assumptions, or consider adding a read-write lock if concurrent access is possible. At minimum, add a comment explaining that the caller must ensure no concurrent reads to heights being pruned.

// PruneBlocks removes block data (header, data, signature, and hash index)
// up to and including the given height from the store.
//
// CONCURRENCY: This method assumes exclusive access to the store during pruning.
// Callers must ensure no concurrent reads/writes occur for heights being pruned.

High-Priority Issues

2. Silent Failure in Best-Effort Pruning (executor.go:443-446)

Severity: High
Location: block/internal/executing/executor.go:443-446

The PR description states: "If pruning fails, we log the error but do not fail the block." This is a double-edged sword:

Concerns:

Repeated pruning failures will cause unbounded disk growth
Only logged at ERROR level with no metrics/alerting
No circuit breaker to disable pruning after repeated failures
Operators may not notice until disk is full

Recommendation:

Add a metric to track pruning failures (e.g., pruning_failures_total)
Consider adding a circuit breaker after N consecutive failures
Add periodic warnings in logs if pruning has been failing for extended time

if err := e.store.PruneBlocks(e.ctx, targetHeight); err != nil {
    e.logger.Error().Err(err).Uint64("target_height", targetHeight).Msg("failed to prune old block data")
    e.metrics.PruningFailures.Inc() // Add this metric
    // Consider: disable pruning after N consecutive failures?
}

3. Missing Validation for Pruning Configuration (config.go:251-292)

Severity: Medium-High
Location: pkg/config/config.go:251-292

The Validate() method doesn't validate the pruning configuration parameters. Several edge cases are not caught:

Missing Validations:

PruningInterval can be 0 even when PruningEnabled is true
PruningKeepRecent of 1 is dangerous (would prune almost everything)
No validation that PruningKeepRecent is reasonable relative to DA submission delays

Recommendation:
Add validation in config.go:251-292:

// Validate pruning configuration
if c.Node.PruningEnabled {
    if c.Node.PruningInterval == 0 {
        return fmt.Errorf("pruning_interval must be >= 1 when pruning is enabled")
    }
    if c.Node.PruningKeepRecent == 0 {
        return fmt.Errorf("pruning_keep_recent must be > 0 when pruning is enabled (or disable pruning)")
    }
    if c.Node.PruningKeepRecent < 10 {
        return fmt.Errorf("pruning_keep_recent must be >= 10 to avoid premature data loss")
    }
}

4. Incomplete State Pruning Documentation (types.go:85-94)

Severity: Medium
Location: pkg/store/types.go:85-94

The Pruner interface documentation states it doesn't modify state snapshots, but this creates a significant disk usage issue that's not clearly communicated:

Issue:

Block data (headers, data, signatures) gets pruned
State snapshots at /s/{height} are NOT pruned
Over time, state snapshots will dominate disk usage
No mechanism exists to prune old state snapshots

Recommendation:

Update documentation to explicitly warn about state snapshot accumulation
Consider future work to add state snapshot pruning (separate from block pruning)
Add comments in executor.go explaining why state is not pruned

Medium-Priority Issues

5. Error Handling Inconsistency in PruneBlocks (store.go:324-341)

Severity: Medium
Location: pkg/store/store.go:324-341

Error handling for batch deletions is inconsistent:

if err := batch.Delete(ctx, ds.NewKey(getDataKey(h))); err != nil {
    if !errors.Is(err, ds.ErrNotFound) {  // Ignores NotFound
        return fmt.Errorf("failed to delete data at height %d during pruning: %w", h, err)
    }
}

Issue:

Header deletion (line 320) returns error immediately, not checking for ErrNotFound
Data, signature, and index deletions (lines 324-341) ignore ErrNotFound
This inconsistency suggests header is assumed to always exist, but that's not documented

Recommendation:
Make error handling consistent and document assumptions:

// Get header blob to compute the hash index key. If header is already
// missing (e.g. due to previous partial pruning), just skip this height.
headerBlob, err := s.db.Get(ctx, ds.NewKey(getHeaderKey(h)))
if err != nil {
    if errors.Is(err, ds.ErrNotFound) {
        continue  // Already pruned or never existed
    }
    return fmt.Errorf("failed to get header at height %d during pruning: %w", h, err)
}

6. Synchronous Pruning Adds Latency to Block Production (executor.go:433-448)

Severity: Medium
Location: block/internal/executing/executor.go:433-448

The PR description acknowledges this: "PruneBlocks is called synchronously in produceBlock. On pruning heights, block production will incur extra latency."

Quantification:

If pruning 1000 blocks with 4 keys each (header, data, signature, index) = 4000 deletions
Even at 1ms per batch operation, that's 4+ seconds of latency
This could cause block time violations and consensus issues

Recommendation:

Add timing metrics specifically for pruning operations
Consider making pruning asynchronous with a background goroutine
Add configuration to limit batch size (prune at most N blocks per interval)

// Limit pruning batch size to avoid blocking too long
const maxPrunePerInterval = 100
if newHeight > e.config.Node.PruningKeepRecent {
    targetHeight := newHeight - e.config.Node.PruningKeepRecent
    
    // Get last pruned height
    lastPruned := e.getLastPrunedHeight(ctx)
    
    // Limit pruning to avoid long blocking
    if targetHeight - lastPruned > maxPrunePerInterval {
        targetHeight = lastPruned + maxPrunePerInterval
    }
    
    if err := e.store.PruneBlocks(e.ctx, targetHeight); err != nil {
        e.logger.Error().Err(err).Uint64("target_height", targetHeight).Msg("failed to prune old block data")
    }
}

7. Missing Test for Partial Pruning Failures (store_test.go)

Severity: Medium
Location: pkg/store/store_test.go

The test TestPruneBlocks_RemovesOldBlockDataOnly (line 737) tests happy path, but doesn't test:

What happens if batch.Commit() fails mid-way?
What if some deletions succeed but others fail?
What if the metadata update fails but deletions succeed?

Recommendation:
Add a test case using the mock datastore to simulate partial failures:

func TestPruneBlocks_PartialFailure(t *testing.T) {
    // Test that pruning is idempotent when partially completed
    // Simulate failure after some deletions but before commit
}

Low-Priority Issues

8. Magic Number for Height Length (store.go:356)

Severity: Low
Location: pkg/store/store.go:356

const heightLength = 8

This is fine, but could benefit from a comment explaining why 8 (uint64 encoding).

Recommendation:

const heightLength = 8 // size of uint64 in bytes

9. Typo in PR Title

Severity: Low
Location: PR Title

"Block Prunning" should be "Block Pruning" (single 'n')

Code Quality & Architecture

Positive Aspects ✅

Clean Interface Design: The Pruner interface is well-defined and properly separated from Reader and Rollback
Good Idempotency: The implementation properly tracks LastPrunedBlockHeightKey to make pruning idempotent
Proper Metadata Tracking: Using metadata keys for tracking pruning progress is the right approach
Comprehensive Tests: Good test coverage including edge cases for rollback with DA height validation

Architecture Concerns

Monotonic Pruning Assumption:
The design assumes pruning always moves forward (monotonic). This is correct, but consider documenting what happens if an operator wants to "undo" pruning (e.g., restore from backup with higher blocks). The metadata would need manual cleanup.

Performance & Resource Efficiency

Batch Operation Efficiency

The implementation batches all deletions for a height range, which is good. However:

Observation:

Large pruning ranges (e.g., initial pruning of 10,000 blocks) create very large batches
Some datastore implementations may have batch size limits
No backpressure mechanism if batch becomes too large

Recommendation:
Consider chunking large pruning operations:

const maxPruneBlocksPerBatch = 1000

for h := lastPruned + 1; h <= height; h++ {
    // ... deletion logic ...
    
    // Commit every N blocks to avoid huge batches
    if (h - lastPruned) % maxPruneBlocksPerBatch == 0 {
        if err := batch.Commit(ctx); err != nil {
            return fmt.Errorf("failed to commit pruning batch: %w", err)
        }
        // Create new batch for next chunk
        batch, err = s.db.Batch(ctx)
        if err != nil {
            return fmt.Errorf("failed to create new batch: %w", err)
        }
    }
}

Security

No Critical Security Issues Found ✅

The pruning implementation doesn't introduce security vulnerabilities:

Properly validates height parameters
Uses safe metadata keys
Doesn't expose sensitive data
Correctly preserves state data (not pruned)

Minor Security Note

The silent failure mode (best-effort pruning) could theoretically be exploited if an attacker can cause persistent pruning failures, leading to disk exhaustion DoS. However, this would require the attacker to already have compromised the datastore layer, which is a more fundamental breach.

Testing & Reliability

Test Coverage Assessment

Well-Covered:

✅ Basic pruning removes old blocks
✅ State snapshots are preserved
✅ Idempotency (calling with same height is no-op)
✅ Handling of missing headers during pruning

Missing Coverage:

❌ Pruning interval logic in executor (lines 437-448)
❌ Concurrent read during pruning
❌ Large range pruning (10,000+ blocks)
❌ Batch commit failures
❌ Metrics recording for pruning operations

Recommendation:
Add integration test in test/integration/ that:

Produces 1000 blocks
Enables pruning with keep_recent=100, interval=50
Verifies pruning happens at correct intervals
Verifies blocks are actually deleted
Measures pruning latency impact

Documentation & Developer Experience

Documentation Gaps

No operator documentation: How should operators choose pruning_keep_recent value? What are the disk space implications?
No migration guide: What happens when enabling pruning on existing nodes?
No metrics documentation: What metrics should operators monitor?

Recommendation:
Add operator guide section to CLAUDE.md or create docs/pruning.md:

## Block Pruning Configuration

### Choosing pruning_keep_recent

The `pruning_keep_recent` value determines how many recent blocks to keep:

- Minimum: 10 blocks (safety margin)
- Recommended: 1000 blocks for full nodes
- For archive nodes: Disable pruning

Consider:
- DA submission delays (blocks must exist until DA confirmed)
- Peer sync requests (peers may request recent blocks)
- RPC query requirements

### Enabling Pruning on Existing Nodes

1. Stop the node
2. Edit config to set:
   - pruning_enabled: true
   - pruning_keep_recent: 1000
   - pruning_interval: 100
3. Restart node
4. First pruning will happen at next interval

Summary & Recommendations

Must Fix Before Merge

⚠️ Add pruning configuration validation (Issue Transaction gossiping #3)
⚠️ Add pruning failure metrics (Issue Peer discovery #2)
⚠️ Document concurrency assumptions (Issue ABCI interface #1)

Should Fix Before Merge

Add batch size limiting to prevent huge transactions (Performance section)
Add integration test for pruning interval logic
Fix error handling inconsistency (Issue Mempool #5)

Nice to Have

Add operator documentation
Add timing metrics for pruning operations
Consider async pruning in future PR
Fix typo in PR title

Overall Assessment

This is a solid implementation of a much-needed feature. The code quality is good, the interface design is clean, and there's reasonable test coverage. The main concerns are:

Operational safety: Silent failures could lead to disk exhaustion
Performance impact: Synchronous pruning adds latency to block production
Configuration validation: Edge cases not prevented

With the recommended fixes, particularly around validation and metrics, this will be production-ready.

Recommendation: Request changes for critical issues (#1-3), then approve once addressed.

julienrbrt · 2026-01-16T14:19:03Z

pkg/store/store.go

Given we have a lot of different stores and keys. Are we planning to prune as well:

metadatas?

go-header store? (there are settings to be looked at)

execution/evm store (https://github.com/evstack/ev-node/blob/main/execution/evm/store.go)

(and eventually ev-abci store)

It will be inconsistent if we don't allow height pruning of all those stores (at once).

basic prunning

68b4230

github-actions bot assigned pthmas Jan 16, 2026

julienrbrt reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block Prunning #2984

Block Prunning #2984

Uh oh!

pthmas commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

claude bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

julienrbrt Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Block Prunning #2984

Are you sure you want to change the base?

Block Prunning #2984

Uh oh!

Conversation

pthmas commented Jan 16, 2026

Overview

Config

Design trade-offs

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

claude bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

Critical Issues

1. Race Condition Risk in PruneBlocks (store.go:283-354)

High-Priority Issues

2. Silent Failure in Best-Effort Pruning (executor.go:443-446)

3. Missing Validation for Pruning Configuration (config.go:251-292)

4. Incomplete State Pruning Documentation (types.go:85-94)

Medium-Priority Issues

5. Error Handling Inconsistency in PruneBlocks (store.go:324-341)

6. Synchronous Pruning Adds Latency to Block Production (executor.go:433-448)

7. Missing Test for Partial Pruning Failures (store_test.go)

Low-Priority Issues

8. Magic Number for Height Length (store.go:356)

9. Typo in PR Title

Code Quality & Architecture

Positive Aspects ✅

Architecture Concerns

Performance & Resource Efficiency

Batch Operation Efficiency

Security

No Critical Security Issues Found ✅

Minor Security Note

Testing & Reliability

Test Coverage Assessment

Documentation & Developer Experience

Documentation Gaps

Summary & Recommendations

Must Fix Before Merge

Should Fix Before Merge

Nice to Have

Overall Assessment

Uh oh!

julienrbrt Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Jan 16, 2026 •

edited

Loading

julienrbrt Jan 16, 2026 •

edited

Loading