Skip to content

Add periodic fallback pruning for stalled finalization#179

Open
pablodeymo wants to merge 8 commits intomainfrom
periodic-fallback-pruning
Open

Add periodic fallback pruning for stalled finalization#179
pablodeymo wants to merge 8 commits intomainfrom
periodic-fallback-pruning

Conversation

@pablodeymo
Copy link
Collaborator

Motivation

PR #170 adds retention-based pruning for blocks and states, but it only triggers when finalization
advances (finalized.slot > old_finalized_slot in update_checkpoints). If finalization stalls —
due to network issues, low participation, or other consensus failures — the States and
BlockHeaders/BlockBodies/BlockSignatures tables grow unbounded because the pruning condition
is never met.

This is adapted from Zeam's approach, which runs a periodic
fallback every 7200 slots (~8 hours) to prune non-canonical states when finalization is far behind.

Description

Adds a periodic pruning check in on_tick (interval 0) that calls the existing prune_old_states
and prune_old_blocks methods directly, bypassing the finalization-gated path in
update_checkpoints.

Trigger conditions (all must be true):

  1. slot > 0 (skip genesis)
  2. slot is a multiple of PRUNING_FALLBACK_INTERVAL_SLOTS (7200 slots = ~8 hours at 4s/slot)
  3. current_slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS (finalization is truly stalled)

If finalization is healthy, condition 3 is never met, so the normal pruning path in
update_checkpoints handles everything and this code is a no-op (two cheap integer comparisons
per slot-0 tick).

Changes

File Change
crates/storage/src/store.rs Add PRUNING_FALLBACK_INTERVAL_SLOTS constant (7200), add periodic_prune() method, add 2 tests
crates/storage/src/lib.rs Export PRUNING_FALLBACK_INTERVAL_SLOTS
crates/blockchain/src/store.rs Import constant, add periodic pruning check in interval 0 of on_tick

Safety

  • prune_old_states and prune_old_blocks are already idempotent and protect finalized/justified
    roots via the protected_roots parameter
  • periodic_prune passes [latest_finalized.root, latest_justified.root] as protected, matching
    the existing behavior in update_checkpoints
  • The periodic check adds negligible overhead: two integer comparisons per slot-0 tick

How to test

cargo test -p ethlambda-storage -- periodic_prune
cargo test --workspace --release
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --check

New tests:

  • periodic_prune_removes_old_states_and_blocks — verifies pruning beyond retention window while
    protecting finalized/justified roots
  • periodic_prune_no_op_within_retention — verifies no pruning when within retention window

MegaRedHand and others added 4 commits March 2, 2026 11:17
  When finalization stalls (e.g., network issues, low participation), the
  States and Block tables grow unbounded because retention-based pruning
  only triggers on finalization advancement. This adds a periodic check
  every 7200 slots (~8 hours) that calls prune_old_states and
  prune_old_blocks directly when finalization is far behind, adapted from
  Zeam's approach. The check is cheap (two integer comparisons per slot-0
  tick) and the actual pruning reuses existing idempotent methods.
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

🤖 Kimi Code Review

Review Summary

The PR introduces a fallback pruning mechanism for when finalization is stalled, which is a good defensive measure against unbounded storage growth. However, there are several issues that need attention:

Critical Issues

  1. Race condition in on_tick (blockchain/src/store.rs:318-329): The check slot.saturating_sub(store.latest_finalized().slot) > PRUNING_FALLBACK_INTERVAL_SLOTS uses the finalized slot from the store, but this could change between the check and the actual pruning operation. This could lead to pruning when finalization has actually advanced.

  2. Missing synchronization in periodic_prune (storage/src/store.rs:430-440): The method reads latest_finalized() and latest_justified() separately, which could lead to inconsistent state if these values change between reads. Should read both atomically.

Security & Correctness Issues

  1. Incorrect pruning logic (storage/src/store.rs:430): The protected roots only include latest_finalized().root and latest_justified().root, but should also include all their ancestor blocks/states within the retention window. The current implementation might prune necessary historical data.

  2. Test setup issue (storage/src/store.rs:1458-1467): In the test periodic_prune_removes_old_states_and_blocks, the finalized checkpoint is set to slot 0 but the justified checkpoint is set to slot 1. This is an invalid state since justified should always be ≥ finalized.

Performance & Best Practices

  1. Redundant check (blockchain/src/store.rs:319): The slot > 0 check is redundant since slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS) will be false for slot 0 anyway.

  2. Logging level inconsistency: The warning log in on_tick uses warn! but the actual pruning in periodic_prune uses info!. For consistency and operator awareness, both should probably be at the same level.

  3. Magic number exposure: PRUNING_FALLBACK_INTERVAL_SLOTS is exported from storage but only used in blockchain. Consider keeping it private to storage and exposing a higher-level configuration method instead.

Suggested Fixes

// In blockchain/src/store.rs, fix the race condition:
let current_finalized_slot = store.latest_finalized().slot;
if slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(current_finalized_slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    // ... rest of code
}

// In storage/src/store.rs, fix atomic reads:
pub fn periodic_prune(&mut self) {
    let (finalized_checkpoint, justified_checkpoint) = {
        let finalized = self.latest_finalized();
        let justified = self.latest_justified();
        (finalized, justified)
    };
    let protected_roots = [finalized_checkpoint.root, justified_checkpoint.root];
    // ... rest of code
}

Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

🤖 Codex Code Review

  1. High: periodic_prune can create dangling LiveChain entries by deleting non-finalized headers
  • periodic_prune only calls prune_old_states + prune_old_blocks (store.rs:430).
  • prune_old_blocks deletes from BlockHeaders/BlockBodies/BlockSignatures but not LiveChain (store.rs:654).
  • During finalization stalls, LiveChain is explicitly still the source of fork-choice candidates, so this can leave roots selectable by fork choice but missing headers/state metadata.
  • There are multiple expect(...) paths that assume header availability for key roots, e.g. store.rs:1101, store.rs:1108, blockchain/store.rs:652, blockchain/store.rs:656, blockchain/store.rs:668.
  • Suggested fix: when pruning blocks in stalled-finality mode, either prune corresponding LiveChain entries too, or avoid pruning any root still present in LiveChain.
  1. Medium: off-by-one in fallback trigger delays first prune by one full interval
  • Condition uses slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS (blockchain/store.rs:321).
  • With finalized slot 0, first prune runs at slot 14400 (16h), not 7200 (8h) as documented.
  • If intended behavior is “prune every 7200 slots once lag reaches 7200”, this should be >=.
  1. Medium (performance): full-table scan/sort on tick path
  • Each fallback run does full BlockHeaders scans + sort for both states and blocks (store.rs:569, store.rs:620).
  • Running this synchronously inside on_tick interval 0 (blockchain/store.rs:317) can cause latency spikes under large DBs.
  • Consider incremental/index-based pruning or bounded batch deletion per tick.

Could not run tests in this environment due rustup permission errors (/home/runner/.rustup/tmp not writable), so findings are from static review.


Automated review by OpenAI Codex · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR addresses a gap in PR #170's retention-based pruning: because prune_old_states / prune_old_blocks were only called when finalized.slot advanced, a prolonged consensus stall (network partition, low participation) allowed the States and BlockHeaders/BlockBodies/BlockSignatures tables to grow without bound. The fix wires a periodic fallback check into the existing on_tick interval-0 path and delegates to a new Store::periodic_prune() method that reuses the already-idempotent pruning helpers, protecting both the current finalized and justified roots.

Key points:

  • The three-condition guard (slot > 0, multiple of 7200, stall gap > 7200) is conservative by design: when finalization is healthy the check is a no-op (two comparisons + one metadata read per slot).
  • periodic_prune deliberately matches update_checkpoints's protected-roots set [finalized.root, justified.root], so no live checkpoint can be accidentally removed.
  • Two new unit tests cover the pruning and no-op paths at the Store level; the on_tick trigger conditions are straightforward.
  • One test comment contains misleading phrasing about block insertion counts that should be corrected for clarity.

Confidence Score: 4/5

  • Safe to merge; the periodic pruning implementation is conservative, reuses idempotent helpers, and the guard conditions prevent false positives when finalization is healthy.
  • The PR correctly implements fallback pruning for stalled finalization with conservative guard conditions. The three-part check ensures the code is a no-op when finalization is healthy. Protected-roots handling mirrors the existing update_checkpoints logic, so no valid checkpoint can be accidentally removed. The only finding is a misleading test comment that should be clarified for maintainability.
  • crates/storage/src/store.rs — test comment clarification

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["on_tick — interval 0"] --> B{slot > 0?}
    B -- No --> G[Skip pruning]
    B -- Yes --> C{slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0?}
    C -- No --> G
    C -- Yes --> D{"slot - finalized_slot\n> PRUNING_FALLBACK_INTERVAL_SLOTS?"}
    D -- No\nFinalization healthy --> G
    D -- Yes\nFinalization stalled --> E["warn! log stall event"]
    E --> F["store.periodic_prune()"]
    F --> F1["protected = [finalized.root, justified.root]"]
    F1 --> F2["prune_old_states(protected)\nprune_old_blocks(protected)"]
    F2 --> F3{anything pruned?}
    F3 -- Yes --> F4["info! log pruned counts"]
    F3 -- No --> G
Loading

Last reviewed commit: d6c7958

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

🤖 Claude Code Review

Here is my review of PR #179:


Review: Add periodic fallback pruning for stalled finalization

The motivation is clear and the implementation is straightforward. A few points to consider:


Notable: periodic_prune is narrower than update_checkpoints

periodic_prune (store.rs:430) only calls prune_old_states + prune_old_blocks. The update_checkpoints path also calls prune_gossip_signatures, prune_attestation_data_by_root, and prune_aggregated_payload_table for both aggregated tables.

GossipSignatures are deleted post-aggregation (per delete_gossip_signatures at blockchain/src/store.rs:206), so they don't accumulate across slots under normal stall conditions. However, if aggregation fails for a given validator (e.g., insufficient co-signers) the signature stays in the table indefinitely during a stall — prune_gossip_signatures(finalized_slot) cannot help because the finalized slot doesn't advance.

This is an existing pre-stall risk being widened by this PR. Whether it's in-scope here is a judgement call, but it's worth documenting as a known gap.


Minor: latest_finalized() is read 3 times in the same tick

In blockchain/src/store.rs interval-0 handler, latest_finalized() is called:

  • Line 321: slot.saturating_sub(store.latest_finalized().slot) — condition check
  • Line 326: finalized_slot = store.latest_finalized().slot — inside warn!
  • Inside periodic_prune() at store.rs:431: self.latest_finalized().root

That's 3 DB reads for the same checkpoint value that doesn't change between calls. The simplest fix is to read the checkpoint once before the if:

let finalized = store.latest_finalized();
if slot > 0
    && slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(finalized.slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    warn!(%slot, finalized_slot = finalized.slot, "Finalization stalled, ...");
    store.periodic_prune();
}

periodic_prune would still do its own reads internally, but eliminating the duplicate in the caller is a clean win.


Minor: No test for the trigger condition

The two new tests exercise periodic_prune() directly. The 3-part guard in blockchain/src/store.rs (slot > 0, multiple-of check, lag check) has no test coverage. A small on_tick test that verifies:

  • pruning fires when slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0 and slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS
  • pruning does NOT fire when finalization is healthy (lag ≤ threshold)

...would close that gap and act as a regression guard if the condition logic is ever refactored.


Positive notes

  • The trigger condition is logically correct. saturating_sub safely handles any edge case where finalized_slot >= slot.
  • is_multiple_of is stable since Rust 1.86.0; the project targets 1.92.0. ✓
  • The correct ordering (prune_old_states before prune_old_blocks) is preserved — states depend on block headers for slot lookup.
  • The 7-day-equivalent constant (7200 slots ≈ 8 hours) is well-chosen and consistent with Zeam's approach.
  • The condition truly is a no-op when finalization is healthy: two integer comparisons per slot-0 tick.
  • Test math for periodic_prune_removes_old_states_and_blocks is correct: 5 excess entries minus 2 protected (finalized + justified) = 3 deleted → STATES_TO_KEEP + 2 remaining.
  • warn! log level is appropriate for a stalled finalization event.

Automated review by Claude (Anthropic) · custom prompt

pablodeymo and others added 3 commits March 3, 2026 12:52
…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
  Two tests that verify the 3-part guard in on_tick interval 0:
  - on_tick_triggers_periodic_pruning_when_finalization_stalled: pruning
    fires when slot is a multiple of the interval AND finalization lag
    exceeds the threshold
  - on_tick_skips_periodic_pruning_when_finalization_healthy: pruning does
    NOT fire when finalization is recent (lag within threshold)

  Also exports Table, StorageWriteBatch, and StorageReadView from the
  storage crate to enable direct backend access in blockchain tests.
Base automatically changed from prune-states to main March 3, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants