Add periodic fallback pruning for stalled finalization by pablodeymo · Pull Request #179 · lambdaclass/ethlambda

pablodeymo · 2026-03-03T15:37:19Z

Motivation

PR #170 adds retention-based pruning for blocks and states, but it only triggers when finalization
advances (finalized.slot > old_finalized_slot in update_checkpoints). If finalization stalls —
due to network issues, low participation, or other consensus failures — the States and
BlockHeaders/BlockBodies/BlockSignatures tables grow unbounded because the pruning condition
is never met.

This is adapted from Zeam's approach, which runs a periodic
fallback every 7200 slots (~8 hours) to prune non-canonical states when finalization is far behind.

Description

Adds a periodic pruning check in on_tick (interval 0) that calls the existing prune_old_states
and prune_old_blocks methods directly, bypassing the finalization-gated path in
update_checkpoints.

Trigger conditions (all must be true):

slot > 0 (skip genesis)
slot is a multiple of PRUNING_FALLBACK_INTERVAL_SLOTS (7200 slots = ~8 hours at 4s/slot)
current_slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS (finalization is truly stalled)

If finalization is healthy, condition 3 is never met, so the normal pruning path in
update_checkpoints handles everything and this code is a no-op (two cheap integer comparisons
per slot-0 tick).

Changes

File	Change
`crates/storage/src/store.rs`	Add `PRUNING_FALLBACK_INTERVAL_SLOTS` constant (7200), add `periodic_prune()` method, add 2 tests
`crates/storage/src/lib.rs`	Export `PRUNING_FALLBACK_INTERVAL_SLOTS`
`crates/blockchain/src/store.rs`	Import constant, add periodic pruning check in interval 0 of `on_tick`

Safety

prune_old_states and prune_old_blocks are already idempotent and protect finalized/justified
roots via the protected_roots parameter
periodic_prune passes [latest_finalized.root, latest_justified.root] as protected, matching
the existing behavior in update_checkpoints
The periodic check adds negligible overhead: two integer comparisons per slot-0 tick

How to test

cargo test -p ethlambda-storage -- periodic_prune
cargo test --workspace --release
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --check

New tests:

periodic_prune_removes_old_states_and_blocks — verifies pruning beyond retention window while
protecting finalized/justified roots
periodic_prune_no_op_within_retention — verifies no pruning when within retention window

When finalization stalls (e.g., network issues, low participation), the States and Block tables grow unbounded because retention-based pruning only triggers on finalization advancement. This adds a periodic check every 7200 slots (~8 hours) that calls prune_old_states and prune_old_blocks directly when finalization is far behind, adapted from Zeam's approach. The check is cheap (two integer comparisons per slot-0 tick) and the actual pruning reuses existing idempotent methods.

github-actions · 2026-03-03T15:37:47Z

🤖 Kimi Code Review

Review Summary

The PR introduces a fallback pruning mechanism for when finalization is stalled, which is a good defensive measure against unbounded storage growth. However, there are several issues that need attention:

Critical Issues

Race condition in on_tick (blockchain/src/store.rs:318-329): The check slot.saturating_sub(store.latest_finalized().slot) > PRUNING_FALLBACK_INTERVAL_SLOTS uses the finalized slot from the store, but this could change between the check and the actual pruning operation. This could lead to pruning when finalization has actually advanced.
Missing synchronization in periodic_prune (storage/src/store.rs:430-440): The method reads latest_finalized() and latest_justified() separately, which could lead to inconsistent state if these values change between reads. Should read both atomically.

Security & Correctness Issues

Incorrect pruning logic (storage/src/store.rs:430): The protected roots only include latest_finalized().root and latest_justified().root, but should also include all their ancestor blocks/states within the retention window. The current implementation might prune necessary historical data.
Test setup issue (storage/src/store.rs:1458-1467): In the test periodic_prune_removes_old_states_and_blocks, the finalized checkpoint is set to slot 0 but the justified checkpoint is set to slot 1. This is an invalid state since justified should always be ≥ finalized.

Performance & Best Practices

Redundant check (blockchain/src/store.rs:319): The slot > 0 check is redundant since slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS) will be false for slot 0 anyway.
Logging level inconsistency: The warning log in on_tick uses warn! but the actual pruning in periodic_prune uses info!. For consistency and operator awareness, both should probably be at the same level.
Magic number exposure: PRUNING_FALLBACK_INTERVAL_SLOTS is exported from storage but only used in blockchain. Consider keeping it private to storage and exposing a higher-level configuration method instead.

Suggested Fixes

// In blockchain/src/store.rs, fix the race condition:
let current_finalized_slot = store.latest_finalized().slot;
if slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(current_finalized_slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    // ... rest of code
}

// In storage/src/store.rs, fix atomic reads:
pub fn periodic_prune(&mut self) {
    let (finalized_checkpoint, justified_checkpoint) = {
        let finalized = self.latest_finalized();
        let justified = self.latest_justified();
        (finalized, justified)
    };
    let protected_roots = [finalized_checkpoint.root, justified_checkpoint.root];
    // ... rest of code
}

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-03-03T15:39:45Z

🤖 Codex Code Review

High: periodic_prune can create dangling LiveChain entries by deleting non-finalized headers

periodic_prune only calls prune_old_states + prune_old_blocks (store.rs:430).
prune_old_blocks deletes from BlockHeaders/BlockBodies/BlockSignatures but not LiveChain (store.rs:654).
During finalization stalls, LiveChain is explicitly still the source of fork-choice candidates, so this can leave roots selectable by fork choice but missing headers/state metadata.
There are multiple expect(...) paths that assume header availability for key roots, e.g. store.rs:1101, store.rs:1108, blockchain/store.rs:652, blockchain/store.rs:656, blockchain/store.rs:668.
Suggested fix: when pruning blocks in stalled-finality mode, either prune corresponding LiveChain entries too, or avoid pruning any root still present in LiveChain.

Medium: off-by-one in fallback trigger delays first prune by one full interval

Condition uses slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS (blockchain/store.rs:321).
With finalized slot 0, first prune runs at slot 14400 (16h), not 7200 (8h) as documented.
If intended behavior is “prune every 7200 slots once lag reaches 7200”, this should be >=.

Medium (performance): full-table scan/sort on tick path

Each fallback run does full BlockHeaders scans + sort for both states and blocks (store.rs:569, store.rs:620).
Running this synchronously inside on_tick interval 0 (blockchain/store.rs:317) can cause latency spikes under large DBs.
Consider incremental/index-based pruning or bounded batch deletion per tick.

Could not run tests in this environment due rustup permission errors (/home/runner/.rustup/tmp not writable), so findings are from static review.

Automated review by OpenAI Codex · custom prompt

greptile-apps · 2026-03-03T15:42:45Z

Greptile Summary

This PR addresses a gap in PR #170's retention-based pruning: because prune_old_states / prune_old_blocks were only called when finalized.slot advanced, a prolonged consensus stall (network partition, low participation) allowed the States and BlockHeaders/BlockBodies/BlockSignatures tables to grow without bound. The fix wires a periodic fallback check into the existing on_tick interval-0 path and delegates to a new Store::periodic_prune() method that reuses the already-idempotent pruning helpers, protecting both the current finalized and justified roots.

Key points:

The three-condition guard (slot > 0, multiple of 7200, stall gap > 7200) is conservative by design: when finalization is healthy the check is a no-op (two comparisons + one metadata read per slot).
periodic_prune deliberately matches update_checkpoints's protected-roots set [finalized.root, justified.root], so no live checkpoint can be accidentally removed.
Two new unit tests cover the pruning and no-op paths at the Store level; the on_tick trigger conditions are straightforward.
One test comment contains misleading phrasing about block insertion counts that should be corrected for clarity.

Confidence Score: 4/5

Safe to merge; the periodic pruning implementation is conservative, reuses idempotent helpers, and the guard conditions prevent false positives when finalization is healthy.
The PR correctly implements fallback pruning for stalled finalization with conservative guard conditions. The three-part check ensures the code is a no-op when finalization is healthy. Protected-roots handling mirrors the existing update_checkpoints logic, so no valid checkpoint can be accidentally removed. The only finding is a misleading test comment that should be clarified for maintainability.
crates/storage/src/store.rs — test comment clarification

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["on_tick — interval 0"] --> B{slot > 0?}
    B -- No --> G[Skip pruning]
    B -- Yes --> C{slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0?}
    C -- No --> G
    C -- Yes --> D{"slot - finalized_slot\n> PRUNING_FALLBACK_INTERVAL_SLOTS?"}
    D -- No\nFinalization healthy --> G
    D -- Yes\nFinalization stalled --> E["warn! log stall event"]
    E --> F["store.periodic_prune()"]
    F --> F1["protected = [finalized.root, justified.root]"]
    F1 --> F2["prune_old_states(protected)\nprune_old_blocks(protected)"]
    F2 --> F3{anything pruned?}
    F3 -- Yes --> F4["info! log pruned counts"]
    F3 -- No --> G

_{Last reviewed commit: d6c7958}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

crates/storage/src/store.rs

github-actions · 2026-03-03T15:42:56Z

🤖 Claude Code Review

Here is my review of PR #179:

Review: Add periodic fallback pruning for stalled finalization

The motivation is clear and the implementation is straightforward. A few points to consider:

Notable: `periodic_prune` is narrower than `update_checkpoints`

periodic_prune (store.rs:430) only calls prune_old_states + prune_old_blocks. The update_checkpoints path also calls prune_gossip_signatures, prune_attestation_data_by_root, and prune_aggregated_payload_table for both aggregated tables.

GossipSignatures are deleted post-aggregation (per delete_gossip_signatures at blockchain/src/store.rs:206), so they don't accumulate across slots under normal stall conditions. However, if aggregation fails for a given validator (e.g., insufficient co-signers) the signature stays in the table indefinitely during a stall — prune_gossip_signatures(finalized_slot) cannot help because the finalized slot doesn't advance.

This is an existing pre-stall risk being widened by this PR. Whether it's in-scope here is a judgement call, but it's worth documenting as a known gap.

Minor: `latest_finalized()` is read 3 times in the same tick

In blockchain/src/store.rs interval-0 handler, latest_finalized() is called:

Line 321: slot.saturating_sub(store.latest_finalized().slot) — condition check
Line 326: finalized_slot = store.latest_finalized().slot — inside warn!
Inside periodic_prune() at store.rs:431: self.latest_finalized().root

That's 3 DB reads for the same checkpoint value that doesn't change between calls. The simplest fix is to read the checkpoint once before the if:

let finalized = store.latest_finalized();
if slot > 0
    && slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(finalized.slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    warn!(%slot, finalized_slot = finalized.slot, "Finalization stalled, ...");
    store.periodic_prune();
}

periodic_prune would still do its own reads internally, but eliminating the duplicate in the caller is a clean win.

Minor: No test for the trigger condition

The two new tests exercise periodic_prune() directly. The 3-part guard in blockchain/src/store.rs (slot > 0, multiple-of check, lag check) has no test coverage. A small on_tick test that verifies:

pruning fires when slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0 and slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS
pruning does NOT fire when finalization is healthy (lag ≤ threshold)

...would close that gap and act as a regression guard if the condition logic is ever refactored.

Positive notes

The trigger condition is logically correct. saturating_sub safely handles any edge case where finalized_slot >= slot.
is_multiple_of is stable since Rust 1.86.0; the project targets 1.92.0. ✓
The correct ordering (prune_old_states before prune_old_blocks) is preserved — states depend on block headers for slot lookup.
The 7-day-equivalent constant (7200 slots ≈ 8 hours) is well-chosen and consistent with Zeam's approach.
The condition truly is a no-op when finalization is healthy: two integer comparisons per slot-0 tick.
Test math for periodic_prune_removes_old_states_and_blocks is correct: 5 excess entries minus 2 protected (finalized + justified) = 3 deleted → STATES_TO_KEEP + 2 remaining.
warn! log level is appropriate for a stalled finalization event.

Automated review by Claude (Anthropic) · custom prompt

…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Two tests that verify the 3-part guard in on_tick interval 0: - on_tick_triggers_periodic_pruning_when_finalization_stalled: pruning fires when slot is a multiple of the interval AND finalization lag exceeds the threshold - on_tick_skips_periodic_pruning_when_finalization_healthy: pruning does NOT fire when finalization is recent (lag within threshold) Also exports Table, StorageWriteBatch, and StorageReadView from the storage crate to enable direct backend access in blockchain tests.

MegaRedHand and others added 4 commits March 2, 2026 11:17

feat: prune states and blocks after a while

4266d0e

chore: run make fmt

9b373c5

Merge branch 'main' into prune-states

278c92d

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

crates/storage/src/store.rs Outdated Show resolved Hide resolved

pablodeymo and others added 3 commits March 3, 2026 12:52

Read latest_finalized() once before the periodic pruning check. Avoid…

040a553

…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.

Update crates/storage/src/store.rs

8ef5e9d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Base automatically changed from prune-states to main March 3, 2026 16:16

Merge branch 'main' into periodic-fallback-pruning

14a311f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add periodic fallback pruning for stalled finalization#179

Add periodic fallback pruning for stalled finalization#179
pablodeymo wants to merge 8 commits intomainfrom
periodic-fallback-pruning

pablodeymo commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pablodeymo commented Mar 3, 2026

Motivation

Description

Changes

Safety

How to test

Uh oh!

github-actions bot commented Mar 3, 2026

🤖 Kimi Code Review

Review Summary

Critical Issues

Security & Correctness Issues

Performance & Best Practices

Suggested Fixes

Uh oh!

github-actions bot commented Mar 3, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps bot commented Mar 3, 2026

Greptile Summary

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

🤖 Claude Code Review

Review: Add periodic fallback pruning for stalled finalization

Notable: periodic_prune is narrower than update_checkpoints

Minor: latest_finalized() is read 3 times in the same tick

Minor: No test for the trigger condition

Positive notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Notable: `periodic_prune` is narrower than `update_checkpoints`

Minor: `latest_finalized()` is read 3 times in the same tick