Summarizer stall observability: lag gauge + zero-balance guard#120
Open
AntiD2ta wants to merge 25 commits into
Open
Summarizer stall observability: lag gauge + zero-balance guard#120AntiD2ta wants to merge 25 commits into
AntiD2ta wants to merge 25 commits into
Conversation
Register chaind_summarizer_lag_epochs{pipeline} GaugeVec and refresh it on
every OnFinalityUpdated invocation via a deferred helper, so silent stalls
in the epoch / block / validator pipelines become alertable within minutes
instead of being discoverable only from dashboards.
Key decisions:
- Single lag gauge with pipeline label rather than the originally proposed
(stalled_epoch, stall_ticks, reason) triple — generalises across all stall
shapes and matches the consumer-lag mental model operators already have.
- Lag arithmetic uses float64 to avoid uint64 underflow when metadata is
briefly ahead of finality (would otherwise read ~1.8e19).
- Disabled pipelines skip WithLabelValues so their series are absent from
/metrics rather than reported as zero.
- Two silent-skip sites upgraded to Warn (len(balances)==0 in summarizeEpoch
and !updated in summarizeEpochs) with stable text so the lag-based alert
rule can match annotations exactly.
- updateLagGauges (does I/O) and setLagGauges (pure compute) split as a
testable seam — internal-package tests drive setLagGauges directly with
synthetic metadata.
- main_test.go switched from binary-level env-gating to flag.Set("test.skip",
"TestService") so internal-package unit tests run unconditionally while
the existing env-gated TestService is still skipped.
Files changed:
- services/summarizer/standard/metrics.go
- services/summarizer/standard/handler.go
- services/summarizer/standard/epoch.go
- services/summarizer/standard/main_test.go
- services/summarizer/standard/handler_internal_test.go (new)
- go.mod (kylelemons/godebug indirect via prometheus testutil)
Notes for next iteration:
- Issue 4 (smoke-test verification on Hoodi) and issue 5 (ADR for the
metric-shape decision) are unblocked once this lands and is deployed.
- Pre-existing flaky deadlock-detector failure in services/scheduler/standard
TestMulti is unrelated and reproduces on clean master.
Publish docs/adr/0001-summarizer-observability-shape.md, the first ADR in
this repository, recording the decision to expose summarizer stall state
as a single labelled gauge chaind_summarizer_lag_epochs{pipeline} rather
than the originally-proposed (stalled_epoch, stall_ticks, reason) triple.
Derived from the draft at .scratch/adr-0001-summarizer-observability-shape.md
and polished into the MADR template since the repo had no prior ADR style
to follow.
Key decisions:
- MADR template (Status / Context and Problem Statement / Decision Drivers
/ Considered Options / Decision Outcome with Pros and Cons) chosen as
the recognisable standard. Confirm choice with upstream maintainers in
PR review per the issue guidance.
- ADR documents the decision as shipped in commit 066bb38, not as drafted.
The draft's "Summarizer skipped epoch..." log string never landed; the
shipped strings — "No validator balances available; cannot summarize
epoch (will retry on next finality tick)" and "Not enough data to update
summary; will retry on next finality tick" — are recorded verbatim so
the alert annotation contract stays accurate.
- Implementation notes capture the testable-seam split between
updateLagGauges (does I/O) and setLagGauges (pure compute), the float64
cast that prevents uint64 underflow, and the disabled-pipeline behaviour
(no WithLabelValues call so the series is absent from /metrics).
- Rejected alternatives recorded with their trade-offs: the original
(stalled_epoch, stall_ticks, reason) triple, the hybrid lag-gauge +
per-event counter, and pre-populating the gauge at startup via
eth2Client.Finality().
- Alert-annotation contract location corrected post-review: the in-code
comment lives on the summarizerLagEpochs var declaration in metrics.go,
not at the prometheus.Register call site, so the ADR documents the var
declaration as the reference site.
Files changed:
- docs/adr/0001-summarizer-observability-shape.md (new)
Notes for next iteration:
- This is the first ADR in the repo. PR description should call out that
it bootstraps the docs/adr/ convention and invite upstream maintainers
to confirm the MADR template choice.
- Issue 4 (Hoodi smoke-test verification) remains the only open
PRD-A-related task; it requires real Postgres + beacon-node
infrastructure and operator sign-off, so it is HITL despite the AFK
label.
- docs/prometheus.md does not yet document the new
chaind_summarizer_lag_epochs metric; updating it is out of scope here
but is worth a follow-up issue.
The targetEpoch - md.LastXEpoch formula was structurally incapable of detecting the silent-corruption mode the alert was meant to catch: ValidatorBalancesByEpoch's LEFT JOIN returns one zero-balance row per validator when t_validator_balances is empty for an epoch, so the existing len(balances)==0 guard never trips, the cursor advances unconditionally, and the gauge stays pinned at 0. An end-to-end smoke test on a seeded chaind instance (DELETE all t_validator_balances rows for one finalized epoch and observe) confirmed the failure shape: gauge stayed at 0/0/0 throughout, neither alert-contract Warn log fired, the cursor advanced past the corrupt epoch, and t_epoch_summaries was silently written with f_active_balance=0. See docs/adr/0001-summarizer-observability-shape.md "Amendment 1" for the full rationale and the considered alternatives. Re-instrument the gauge as a per-pipeline cursor diff against each pipeline's direct upstream: pipeline=epoch = validators.LatestBalancesEpoch - summarizer.LastEpoch pipeline=block = summarizer.LastEpoch - summarizer.LastBlockEpoch pipeline=validator = summarizer.LastEpoch - summarizer.LastValidatorEpoch Diffs are clamped to [0, +Inf) via a small helper to absorb the benign race window where a downstream cursor is read ahead of its upstream across two metadata reads. The metric name, label key, label values, and gauge registration are unchanged — operators running the existing alert rule do not need to migrate (threshold semantics shift slightly, documented in the ADR). Add a defensive guard in summarizeEpoch after the active-validator balance accumulation loop and before BeginTx: if ActiveValidators > 0 and ActiveBalance == 0, refuse to write a corrupt summary, reuse the existing alert-contract warn message verbatim, and return (false, nil) so the cursor stays put and the cycle retries on the next finality tick. The pre-existing len(balances)==0 check is retained as the bootstrap path with a comment differentiating it from the new production-state guard. Both branches share one warn-log message so the alert annotation contract is single-sourced. Tests: rewrite TestSetLagGaugesWiring with cursor-diff cases (caught up, upstream ahead, intra-summarizer lag, race-window clamp, disabled pipeline absence) and add TestSummarizeEpochZeroBalanceAssertion that drives summarizeEpoch with 1000 validators returning zero balances and asserts (false, nil), the warn log fires, and no row is written to t_epoch_summaries via a recording chaindb mock. Both tests run unconditionally under gosilent test ./... — no env gating. Files changed: - services/summarizer/standard/handler.go - services/summarizer/standard/epoch.go - services/summarizer/standard/metadata.go - services/summarizer/standard/handler_internal_test.go - docs/adr/0001-summarizer-observability-shape.md (Amendment 1)
Add a chaind_summarizer_* section to docs/prometheus.md. All seven summarizer metrics were undocumented (six pre-date this branch — the gap dates to whenever chaind_summarizer_* was introduced — plus the new lag_epochs from this branch). Documenting the full set in one place catches the doc up with reality and gives operators a complete reference for the new alert. Fix the Help string on summarizerLagEpochs: it claimed the gauge measures lag against "the finality target", which was true under the original formula in commit 066bb38 but no longer matches the cursor- diff formula introduced in the previous commit. Replace with "lags its direct upstream cursor" so /metrics, the docs, and the ADR all agree on the gauge's contract.
`-test.skip` is an unanchored regex, so the previous "TestService" pattern would also skip any future test whose name contains that substring (e.g. a hypothetical "TestServiceConfiguration"). Anchor with ^...$ so only the exact env-gated TestService is suppressed when CHAINDB_URL / ETH2CLIENT_ADDRESS are absent.
Extend the issue-10 zero-balance guard pattern (commit 1edcb45) from the epoch-summarizer to the day-summarizer in addValidatorBalanceSummaries. Both ValidatorBalancesByEpoch call sites at validatorday.go (the startBalances fetch and the endBalances fetch) previously carried the same dead len(...) == 0 guard that the issue-7 audit (.scratch/code-reading-notes-balance-fetcher.md § Q4) flagged as a residual landmine: the LEFT JOIN in ValidatorBalancesByEpoch returns one COALESCE(f_balance, 0) row per validator when t_validator_balances is empty for an epoch, so the naive length check cannot detect missing balance rows in steady state. Currently unreachable because of the upstream invariant chain (issue-10 epoch assertion → md.LastEpoch gating + prune safety clamp) but exposed to a manual DELETE targeting an epoch-summarized-but-not-yet-day-summarized epoch, and to any future refactor that decouples the day cursor from the epoch cursor. Key decisions: - Single predicate validatorBalancesPresent collapses the bootstrap (len == 0) and production-state (all-zero rows) shapes into one trip condition, single-sourcing the alert-annotation contract at each call site. Slightly different factoring than epoch.go's two inline branches (which need the active-balance sum for other reasons) — the day-layer has no such constraint, so the early-exit scan is strictly cheaper and more readable. - Warn-log message text reused verbatim ("No validator balances available; cannot summarize epoch (will retry on next finality tick)") so the lag-based alert rule from PRD-A issue 3 pattern-matches without modification. The "epoch" wording is preserved deliberately and the comment at each call site spells out why; structured Time("start_time", …) / Time("end_time", …) zerolog fields carry the day-layer identity for operator disambiguation. - Both guards return (false, nil), matching the existing function contract — the parent summarizeValidatorsInDay short-circuits before BeginTx, so no row is written to t_validator_day_summaries. - No changes to ValidatorBalancesByEpoch or to retention / pruning logic — the team's option-B rejection from issue 10 stands; the LEFT JOIN exists for a real reason, and consumer-side guards remain the chosen pattern. Tests: TestAddValidatorBalanceSummariesZeroBalanceAssertion drives addValidatorBalanceSummaries directly with two table-driven subtests (startBalances all zero, endBalances all zero), each constructing a 1000-validator slice with Balance=0 / EffectiveBalance=0 to mimic the LEFT JOIN shape. Asserts (false, nil) return, the alert-contract warn log fires with the exact message string and structured start_time / end_time fields, and the SetValidatorDaySummaries recorder stays empty (structural-impossibility witness — the function under test never calls the setter, the parent does after a (true, nil) return, so the recorded slice can only be non-empty if a future refactor breaks the short- circuit contract). Tests run unconditionally under gosilent test ./... — no env gating. Files changed: - services/summarizer/standard/validatorday.go - services/summarizer/standard/validatorday_internal_test.go (new) Notes for next iteration: - The .scratch/code-reading-notes-balance-fetcher.md § Q4 residual- landmine entry has been updated locally to record the fix; the scratch tree is git-excluded and so does not appear in the diff. - ADR 0001 already documents the zero-balance-assertion pattern in its "Amendment 1" section; this commit extends the implementation to a second call site without changing the design, so no ADR amendment is filed. - Issue 9 (PRD-C verdict) and the upstream playbook-driven HITL slices remain the next pickups; this defense-in-depth fix is independent of PRD-C convergence.
The interface and implementation comments for PruneValidatorBalances and PruneValidatorEpochSummaries claimed the cutoff was "up to (but not including) the given epoch", but both SQL statements emit `f_epoch <= $1` — inclusive of the boundary epoch. The caller in services/summarizer/standard/prune.go computes `pruneEpoch` as the last epoch outside the retention window and passes it directly to the pruner; "delete everything ≤ pruneEpoch" is the intended semantics, and the SQL is correct. The doc was wrong in both places (interface declaration in chaindb/service.go and the two postgresql/ impl comments). Flagged in issue 7's code-reading audit (`.scratch/code-reading-notes-balance-fetcher.md` Q4) and surfaced again in issue 9's "Open questions for the verdict author" list as a "minor doc/code inconsistency, not a load-bearing bug." Fixing it now keeps the verdict-author distraction list shorter and removes a future-rot hazard before someone reads the doc, trusts it, and writes a caller that relies on exclusive semantics. Key decisions: - Doc updated to match SQL, not the other way around. Changing `<=` to `<` in the SQL would silently retain one extra epoch of every prune cycle's data — a behavior change with retention/disk implications, not a wording fix. No caller exists that depends on the exclusive reading. - "the given point" in the epoch-summaries doc replaced with "the given epoch" to match the function's `to phase0.Epoch` parameter and the balances-pruner sibling for terminological consistency. - No tests touched — the postgresql package has no unit tests for these pruners (the layer is integration-tested via the broader chaindb setup), and a doc-only fix introduces no behavior to assert. Files changed: - services/chaindb/service.go (interface comments for both pruners) - services/chaindb/postgresql/validators.go (impl comment for balances) - services/chaindb/postgresql/validatorepochsummaries.go (impl comment for epoch summaries) Notes for next iteration: - Issue 9 (PRD-C verdict synthesis) remains the only open issue and is HITL — production chaindb-d04 access plus human classification judgment. No AFK successor is queued in `.scratch/issues/`. - Pre-existing diagnostics surfaced incidentally during this edit (blocks.go:895 nilness, validators.go:523/598 slicessort, validatorepochsummaries.go:221/228/234/251 QF1012) are unrelated to this change and are left for a dedicated cleanup pass. - The other "Open question for the verdict author" from issue 7's audit — the vestigial `MissedEpochs` field at services/validators/standard/metadata.go:28 — is a larger judgement call (schema/JSON-compat with persisted t_metadata rows) and was not attempted in this slice.
The investigation slice on this branch closes with three in-tree
artifacts: an ADR documenting the validator-balance fetcher's
persistence-and-recovery contract; a bug-confirming test that
empirically reproduces the silent-cursor-advance shape that caused the
2026-04-07 chaindb-d04 outage; and deprecation comments on the
vestigial MissedEpochs field across three services.
Production forensic capture against chaindb-d04 reframed the 2026-04-07
incident from "singular event" to "one of 113 historical clusters
totalling 828 corrupted t_epoch_summaries rows across ~9 months of
operation." Cluster sizes range from 1 epoch (~6 min on Hoodi) to 159
epochs (~17 hours); the 2026-04-07 incident is at the small end. The
investigation classified the root cause as a beacon-side miss
(upstream condition: the beacon endpoint occasionally returns an empty
200 OK validators response) combined with a chaind-side amplifier in
onEpochTransitionValidatorBalancesForEpoch that silently advances
md.LatestBalancesEpoch when the beacon returns zero validators or
validators whose balances are all zero.
In-tree changes:
- services/validators/standard/handler_internal_test.go (new) — first
test file in this package. Four table-driven cases exercise
onEpochTransitionValidatorBalancesForEpoch with hand-built stubs
(eth2client.ValidatorsProvider, chaindb.ValidatorsSetter,
chaindb.Service, chaintime.Service) modelled on the precedent set
by services/summarizer/standard/handler_internal_test.go. Cases:
(1) empty 200-OK validators map, (2) all-zero-balance validators
amplified through the "do not store 0 balances" filter at
handler.go:206-208, (3) mixed validators sanity, (4) beacon error
sanity. Cases 1 and 2 currently PASS — confirming the bug
exists. Once the focused fix lands, cases 1 and 2 invert from
"cursor advanced" to "cursor unchanged, error returned" and the
test serves as the post-fix regression boundary. No production
code touched.
- docs/adr/0002-validator-balance-fetcher-recovery-model.md (new) —
second ADR in the repo, MADR template matching ADR 0001's style.
Records the fetcher's persistence-and-recovery contract as
"cursor-trusting recovery with empty-response guarded": atomic
co-commit of cursor + balance rows (already enforced); no-empty-
write rejection of zero-validator and all-zero-balance responses
(must be enforced — currently not — a follow-up brings the
implementation into compliance); cursor-trusting recovery on
restart (already enforced). Considered alternatives include a
MissedEpochs queue (Option 4), explicitly rejected because the
consumer code in proposerduties has no population path and the
field is residue from an abandoned design. Open questions for
upstream review noted at the end.
- services/validators/standard/metadata.go,
services/finalizer/standard/metadata.go,
services/proposerduties/standard/metadata.go (modified) —
// Deprecated: comments on each MissedEpochs field declaration.
All three services share the same residue: declared, JSON-migrated
for backward compat, never populated by code in this repo. The
proposerduties variant additionally notes that service.go:166-204
(handleMissed) reads from a permanently-empty list and is dormant
code retained alongside the field. No behavioural change; the
comments serve as a footgun warning for future maintainers and as
a discoverable cross-reference to ADR 0002 Option 4.
Key decisions:
- Empirical-confirmation test landed BEFORE the fix. This makes the
bug diff-visible to reviewers, locks the post-fix contract as a
regression boundary, and gives the documented contract an
empirical backbone. Test names spell out the bug
("...AdvancesCursor") so grep + git blame on a future regression
point immediately at the defect class.
- ADR documents the contract that the follow-up fix brings the
implementation into compliance with — not the buggy current
state. ADR 0001's precedent uses "Amendment N" sections for
follow-up changes; this ADR may need an amendment when the
no-empty-write fix lands, depending on whether reviewers prefer
"ADR matches as-shipped behaviour" or "ADR documents intended
contract, fix tracks compliance." Defer to upstream maintainer
preference.
- MissedEpochs deprecated rather than removed. Removal would
require (a) decoding/upgrading existing t_metadata JSON rows with
"missed_epochs" present in chaindb instances that have been
running for years, (b) deleting the dormant handleMissed consumer
in proposerduties, (c) potentially breaking any out-of-tree fork
that activated the field. None justified by the data. The
deprecation comments make the field's status discoverable; if a
future maintainer has a reason to remove it, the cost is theirs
to bear with full context.
Files changed:
- services/validators/standard/handler_internal_test.go (new, 286
lines)
- docs/adr/0002-validator-balance-fetcher-recovery-model.md (new,
~145 lines)
- services/validators/standard/metadata.go (deprecation comment, 10
lines added)
- services/finalizer/standard/metadata.go (deprecation comment, 8
lines added)
- services/proposerduties/standard/metadata.go (deprecation comment,
9 lines added)
Notes for next iteration:
- The focused no-empty-write fetcher fix is the clear next pickup.
~10 lines of production code + the test inversion in the test file
added here. Effort estimate ~30-60 min.
- The ADR's Implementation-notes section flags a lag-gauge follow-up
question that's worth attention before the fix lands: the current
epoch-pipeline diff is between the fetcher cursor and the
summarizer cursor, and both stall together when the fetcher
refuses to advance, so the existing alert may need a refinement
for the post-fix world. Not a blocker for the fix; a question
about alert semantics.
- PR #120 still open and awaiting upstream review. Once it merges,
this branch's work is complete on the observability axis; the
fetcher-fix axis is independent and can land separately.
onEpochTransitionValidatorBalancesForEpoch advanced
md.LatestBalancesEpoch and committed the database transaction even when
the beacon returned an empty 200-OK validators map, or returned
validators whose Balance fields were all zero (which the "do not store
0 balances" filter at handler.go:217-219 amplified into a zero-row
insert). Either shape produced a t_epoch_summaries row with
f_active_balance=0 AND f_active_validators>0, the silent-corruption
signature documented as the no-empty-write contract violation in
docs/adr/0002-validator-balance-fetcher-recovery-model.md (claim 2).
This commit brings the implementation into compliance with that
contract. Two narrow guards in
onEpochTransitionValidatorBalancesForEpoch:
1. After validators := validatorsResponse.Data, reject len == 0
before BeginTx. No transaction opens on this path.
2. After the bulk-build of dbValidatorBalances, reject len == 0,
cancel() the open transaction so its writes do not commit, and
return. Catches the all-zero-balance shape that the filter at
handler.go:217-219 reduces to an empty slice.
Both rejection sites emit the same warn-log via the new package
constant noValidatorBalancesMsg ("Beacon returned no validator
balances; cannot persist epoch (will retry on next finality tick)").
Single source of truth so a future operator runbook or Prometheus
alert rule keyed off this substring matches both paths uniformly,
and a rename produces a compile error rather than silent test/prod
drift. The error returned propagates through
onEpochTransitionValidatorBalances to OnBeaconChainHeadUpdated where
the existing log.Warn().Err(err).Msg("Failed to update validators")
gives operator visibility. The cursor stays put; the next finality
tick re-runs the fetcher and either succeeds against a recovered
beacon or stalls visibly.
Test inversions in handler_internal_test.go:
- Cases 1 and 2 (renamed from "...AdvancesCursor" to
"...RejectedAndCursorUnchanged") now assert the post-fix
contract: err != nil, no SetValidatorBalances call, cursor
unchanged, no commit, transaction cancelled (case 2 only —
case 1 fires before BeginTx), and the warn-log emitted with
noValidatorBalancesMsg and the structured "epoch" field. Pre-
fix these tests asserted the buggy shape and served as
empirical bug confirmation; post-fix they are the regression
boundary.
- captureWarnLog helper redirects the package-level zerolog
logger to a bytes.Buffer at WarnLevel for the duration of a
test, restoring on cleanup. Mirrors the inline pattern in
services/summarizer/standard/validatorday_internal_test.go,
factored into a helper because two cases need it. Defensive
GlobalLevel save/restore guards against a future main_test.go
setting Disabled to silence noise.
- Cases 3 (mixed validators) and 4 (beacon error) keep their
existing assertions — positive and negative sanity tests,
unaffected by the fix. Case 3's loop modernized to
`for i := range numValidators` for in-file consistency.
Key decisions:
- Single noValidatorBalancesMsg package constant rather than two
hard-coded strings. Both production sites and both test
assertions reference it. Closes the silent-drift hazard that a
test-side warnLogContract constant would have left.
- The fix is intentionally narrow. No new error type, no
refactor of the surrounding s.balances asymmetry (BeginTx fires
outside the if s.balances block; setMetadata/CommitTx are not
inside it either — pre-existing structural quirk, structurally
unreachable today because the parent already gates on
s.balances). Flagged here for a future cleanup pass; not in
scope for this commit.
- Test names changed to describe the post-fix contract. The pre-
fix names spelled the bug ("...AdvancesCursor"); keeping them
would have been misleading once the assertion meaning inverted.
- Error message text "beacon returned no validator balances" is
short and stable; the operator-facing semantics live in the
warn-log line, not the error wrap.
Files changed:
- services/validators/standard/handler.go — noValidatorBalancesMsg
package constant; two guards in
onEpochTransitionValidatorBalancesForEpoch.
- services/validators/standard/handler_internal_test.go — case 1
and case 2 inversions and renames; captureWarnLog helper; case 3
loop modernization for in-file consistency.
Notes for next iteration:
- Lag-gauge semantics question raised by the cursor-coherence fix:
the per-pipeline lag gauge introduced in commit 1edcb45 measures
validators.standard.LatestBalancesEpoch -
summarizer.standard.LastEpoch. When the fetcher refuses to
advance, the summarizer also stalls (it depends on
t_validator_balances), so the diff stays at zero and no alert
fires. Whether to add a fetcher-vs-finality lag gauge as a
separate alert path, or accept "summarizer freezes alongside
fetcher" as the correct semantics for the existing gauge, is a
follow-up for the observability owner. Not a blocker; not in
scope here.
- Two pre-existing efficiency notes flagged during review and left
as-is per the narrow-scope constraint: the if s.balances block
inside onEpochTransitionValidatorBalancesForEpoch is dead
(parent gates on s.balances), and BeginTx/setMetadata/CommitTx
live outside that inner block. Structurally unreachable today;
worth a separate cleanup commit when someone takes the broader
refactor on.
- A pre-existing rangeint diagnostic on case 3's loop was fixed
here for in-file consistency; the broader rangeint sweep across
the chaindb package (validators.go:523/598, etc.) remains a
candidate for a dedicated cleanup pass.
…esForEpoch
The inner `if s.balances {` gate was structurally unreachable: the only
production caller, onEpochTransitionValidatorBalances at handler.go:165-
167, already short-circuits with `if !s.balances { return nil }` before
invoking this function. The dead wrapper was flagged in commit f6d1ef8's
"Notes for next iteration" as worth a separate cleanup commit. This is
that commit.
Pure dedent: contents of the dead block are dedented one level, the
wrapping `if s.balances {` / closing `}` lines are removed. No
statement is moved, added, deleted, or reordered. The function's
external behavior is unchanged for the only reachable call shape
(s.balances == true). All four internal tests in
handler_internal_test.go continue to pass without modification.
Key decisions:
- Scope kept narrow. The `BeginTx`/`setMetadata`/`CommitTx` envelope
stays exactly where it is, even though efficiency review flagged that
`BeginTx` could move below the slice-build to shrink the transaction
window. That is the broader cleanup pass f6d1ef8's notes referenced;
it deserves its own commit and review.
- No precondition doc-comment added on the function. The implicit
contract ("caller must have verified s.balances == true") is plainly
visible in the same file 19 lines above the function header; adding
a comment would be defensive design for a hypothetical future caller.
Files changed:
- services/validators/standard/handler.go — remove inner `if s.balances`
wrapper, dedent its body.
Notes for next iteration:
- Efficiency: `s.chainTime.FirstSlotOfEpoch(epoch)` is called three
times in onEpochTransitionValidatorBalancesForEpoch (lines 192, 193,
212). Trivial to cache in a local variable; pre-existing, flagged
during review of this commit, not in scope here.
- Efficiency: `BeginTx` is acquired before the 1.3M-validator slice
build, widening the transaction window beyond what is necessary.
This is the structural quirk f6d1ef8 flagged as a broader cleanup
pass; remains a follow-up.
- Code-quality: the inline comment "same warn-log contract as the
empty-map guard" (handler.go:235) uses "empty-map" while the rest of
the file and the ADR use "empty-response" / "empty 200-OK" — minor
terminology drift, candidate for a wider docs/comments tidy-up.
Three identical calls to s.chainTime.FirstSlotOfEpoch(epoch) collapse into a single firstSlot local bound at the top of the function: fmt.Sprintf for the beacon API state ID, the trace log, and the OTEL span attribute now all read the same cached value. Flagged as a follow-up in commit 1df1613's "Notes for next iteration": Efficiency: s.chainTime.FirstSlotOfEpoch(epoch) is called three times in onEpochTransitionValidatorBalancesForEpoch (lines 192, 193, 212). Trivial to cache in a local variable; pre-existing, flagged during review of this commit, not in scope here. This is that commit. The motivation is structural rather than performance: FirstSlotOfEpoch itself is a single multiplication (services/chaintime/standard/ service.go:201-203) and the parent runs at most once per ~6-minute finality tick, so the saved cycles are negligible. The value is single-source-of-truth for the slot identity used in three sinks: if a future change swaps one of the three sites for LastSlotOfEpoch, the two remaining sites would silently disagree under the old shape; under the cached shape such an edit produces a single visible difference at the binding site. Key decisions: - Binding placed after `log` (line 191) and before `stateID` (line 193), matching the file's existing top-of-function setup pattern (`log` -> derived scalars -> I/O). `firstSlot` mirrors the sibling local `firstEpoch` in onEpochTransitionValidatorBalances (line 171) for naming consistency. - Cast forms preserved verbatim at each site: `fmt.Sprintf("%d", firstSlot)`, `uint64(firstSlot)`, `int(firstSlot)`. `phase0.Slot` is a `uint64` alias so the casts are no-op widenings; preserving them keeps the diff a literal substitution and the OTEL/zerolog type contracts unchanged. - No tests added or modified. The four existing internal tests in handler_internal_test.go drive onEpochTransitionValidatorBalances- ForEpoch with a chaintimeStub whose FirstSlotOfEpoch returns `phase0.Slot(uint64(epoch) * 32)` deterministically; they assert the values produced, not the call count, and continue to pass without modification. A behavioral test would be defensive design for a hypothetical future caller. Files changed: - services/validators/standard/handler.go - introduce `firstSlot` local; replace three direct calls with the cached value. Notes for next iteration: - Two of the three follow-up items from commit 1df1613's notes remain open: (a) the BeginTx/setMetadata/CommitTx envelope is acquired before the 1.3M-validator slice build, widening the transaction window beyond what is necessary - the broader cleanup pass that f6d1ef8 also flagged; (b) the inline comment "same warn-log contract as the empty-map guard" (handler.go:235) uses "empty-map" while the rest of the file and ADR 0002 use "empty-response" / "empty 200-OK" - terminology drift, candidate for a wider docs/comments tidy-up. Both intentionally deferred to keep this commit narrow. - Pre-existing flake: services/scheduler/standard TestManyJobs triggers the deadlock detector (recursive locking on s.jobsMutex / job.stateLock) intermittently. Confirmed to occur on the clean branch tip without this change via git stash; not a regression of this refactor. Worth a separate bug-confirming test pass when someone takes it on.
The inline back-reference at services/validators/standard/handler.go:236 called the upstream guard the "empty-map guard", but the rest of the file (line 203's self-description: "an empty 200-OK validators response"), the test file (handler_internal_test.go uses both "empty 200-OK response" and "empty-response"), and ADR 0002 throughout (option-1 header, decision outcome, rejected-alternative discussion) all use "empty-response" as the canonical short name. Flagged twice in recent commit notes (5e16b88, 1df1613) as "minor terminology drift, candidate for a wider docs/comments tidy-up." This is the focused alignment. Single-word swap, comment-only. go build, go vet, gofmt, and the four existing tests in services/validators/standard pass unchanged. No behavior change, no test changes. Key decisions: - Minimal scope. A repo-wide grep for "empty-map" returned exactly one match (handler.go:236); expanding into a "wider docs/comments tidy-up" sweep would be speculative churn against the rest of the file's already-consistent vocabulary. The wider sweep stays a separate follow-up if anyone takes it on. - "empty-response guard" chosen over "empty 200-OK guard" because ADR 0002 uses the former as the canonical short name (option-1 header, decision-outcome paragraph, rejected-alternatives discussion). The downstream guard at line 237 is the same no-empty-write contract reified for the all-zero-balance shape, not a different category, so a single umbrella term applies to both back-references. - Comment structure preserved verbatim. Surrounding lines (233-235, 237-241) are byte-identical to pre-change; no rewrite of adjacent prose, no whitespace shifts. Files changed: - services/validators/standard/handler.go - single word in the inline comment block above the all-zero-balance guard. Notes for next iteration: - The services/scheduler/standard TestManyJobs deadlock-detector trip, flagged as "intermittent" in 5e16b88's notes, is now reproducing 5/5 on the parent commit (without this change). The detector reports recursive locking on s.jobsMutex inside ScheduleJob.func1. The symptom has shifted from intermittent to deterministic since the prior commit's note was written - worth a focused bug-confirming pass when someone takes it on, since reliable repro now makes the bug tractable. - Two structural follow-ups remain from earlier commit notes: the BeginTx/setMetadata/CommitTx envelope in onEpochTransitionValidatorBalancesForEpoch is still acquired before the 1.3M-validator slice build (widening the transaction window beyond what is necessary), and pre-existing chaindb diagnostics (blocks.go:895 nilness, validators.go:523/598 slicessort, validatorepochsummaries.go:221/228/234/251 QF1012 sites) await a dedicated cleanup pass. - The repo's .golangci.yml is incompatible with the installed golangci-lint v2 binary ("'output.formats' expected a map, got 'slice'"). Pre-existing config drift, unrelated to this change; candidate for a separate lint-config refresh commit. - Running "go fix ./..." against the current tree produces a large interface{} -> any rewrite across ~14 files (chaindb/ postgresql/*.go, summarizer/, scheduler/ tests, etc.). These rewrites were reverted from this commit to keep scope narrow. Worth a dedicated commit if the maintainers want the modernisation in.
The repo's .golangci.yml was a v1-format configuration; the installed golangci-lint v2.12.2 fails to load it with: Error: can't load config: 'output.formats' expected a map, got 'slice' Flagged in commit b5e3646's "Notes for next iteration" as pre-existing config drift, candidate for a separate lint-config refresh commit. This is that commit. The migration was performed via "golangci-lint migrate --skip-validation" (the bundled v1->v2 converter). The most visible shape changes are structural rather than behavioural: - Top-level "version: 2" added (now required). - "output.formats" is a map keyed by formatter name instead of a list of {format, path} entries. - "linters.enable-all: true" -> "linters.default: all". - Formatter linters (gofmt, gofumpt, goimports) split out of "linters" into a new top-level "formatters" section. - Per-linter settings moved from "linters-settings" to "linters.settings". - Exclusion globs (".*_ssz\\.go$") moved into both "linters.exclusions.paths" and "formatters.exclusions.paths" -- v2 evaluates exclusions per-section rather than globally. Two linters that the v1 disable list named are silently dropped by the migrator because they no longer exist as separately enableable units in v2: - "gci" -- in v2 it is a formatter, not a linter, and the formatters section here only enables gofmt/gofumpt/goimports. Effective status (gci off) unchanged. - "tenv" -- removed in golangci-lint v2 (folded into usetesting). Nothing to opt out of. Key decisions: - Used the bundled "golangci-lint migrate" rather than hand-editing. The tool warns that comments are not migrated, which the original config used heavily as documentation for default values; the migrated file drops those comments in favour of a compact, valid v2 shape. Restoring the comment block would mean re-deriving the v2 defaults by hand and re-annotating; out of scope for a config-refresh commit. - The migrate tool also drops the "issues.include" list of specific exclusion IDs (EXC0002, EXC0005, EXC0009, EXC0011, EXC0012, EXC0014) and replaces them with the broader v2 presets ("common-false-positives" and "std-error-handling"). This is a small behavioural change in what gets excluded, but the v2 presets are the recommended idiom; tuning back to the exact original ID set is left for a follow-up if specific noise appears. - "run.timeout: 10m" is intentionally not preserved; the migrator warns that v2 disables the timeout by default, and matching that default seems preferable to re-introducing a v1-era setting. - Did not add an explicit "gomodguard_v2" entry even though the v2 linter run prints a deprecation warning for "gomodguard". The warning is informational only -- the config still loads, the lint still runs -- and a deeper linter-set refresh (which linters to enable/disable in v2) is a wider piece of work than this commit's scope. Files changed: - .golangci.yml -- full v1 -> v2 migration via "golangci-lint migrate --skip-validation". 120 lines -> 75 lines. Notes for next iteration: - The migrated config emits a deprecation warning for "gomodguard" on every run. Worth a focused follow-up to either swap to "gomodguard_v2" with equivalent settings or to disable it explicitly if no module restrictions are wanted. - The exclusion-ID -> presets translation is approximate. If the v2 lint output now flags previously-suppressed cases (e.g., revive's package-comments, exported-type comments, comments-on- exported), a future commit should restore the original set explicitly via "linters.exclusions.rules" entries. - "linters-settings.lll" / "lll: 132" still present in the migrated config but lll is in the disable list, so the setting is dead. Cosmetic, not a config error; can be removed in a cleanup pass.
Flagged in commit b5e3646's "Notes for next iteration": Running "go fix ./..." against the current tree produces a large interface{} -> any rewrite across ~14 files (chaindb/postgresql/*.go, summarizer/, scheduler/ tests, etc.). These rewrites were reverted from this commit to keep scope narrow. Worth a dedicated commit if the maintainers want the modernisation in. This is that commit. Identical in scope to what "go fix ./..." emits against the tree at this point, with no hand-edits beyond what the fixer produced. All test/build/vet gates remain green; the pre- existing services/scheduler/standard TestManyJobs deadlock-detector trip (deterministic on this branch tip, unrelated to this commit per b5e3646's notes) remains the sole failing case. The sweep covers five categories of go-fix-driven modernisations: 1. interface{} -> any (Go 1.18+ alias) - services/chaindb/postgresql/blobsidecars.go - services/chaindb/postgresql/proposerduties.go - services/chaindb/postgresql/setblobsidecars.go - services/chaindb/postgresql/setconsolidationrequests.go - services/chaindb/postgresql/setdepositrequests.go - services/chaindb/postgresql/setwithdrawalrequests.go 2. sort.Slice(s, func(i, j) ...) -> slices.Sort(s) (Go 1.21+) - services/chaindb/postgresql/aggregatevalidatorbalances.go - services/chaindb/postgresql/validators.go (two sites) - services/summarizer/standard/epoch.go (two sites) Also addresses the slicessort diagnostic flagged in b5e3646's notes (validators.go:523/598). 3. strings.HasPrefix(s, p) + strings.TrimPrefix(s, p) -> strings.CutPrefix (Go 1.20+) - services/chaindb/postgresql/chainspec.go 4. ad-hoc min/max via if-block -> max()/min() builtin (Go 1.21+) - services/chaintime/standard/service.go FirstEpochOfSyncPeriod's altair-fork floor. - services/eth1deposits/getlogs/service.go blocksPerRequest end-block clamp. 5. C-style "for i := 0; i < n; i++" -> "for i := range n" or "for range n" (Go 1.22+) - services/scheduler/standard/service_test.go (two sites) - services/summarizer/standard/handler_internal_test.go Key decisions: - One bundled commit rather than five micro-commits. Each category is mechanical, the diff is small, and "go fix ./..." names the scope at a glance for any future bisection. Splitting by category would introduce churn without making any individual change easier to audit. - No surrounding-code cleanup or refactor folded in. Cases like validators.go's pre-fix sort.Slice closure could plausibly have been reordered alongside other nearby cleanups, but the rule for go-fix sweeps is "exactly what the fixer emits, nothing more" so the diff stays trivially auditable. - The commit covers the slicessort diagnostic flagged in b5e3646's notes (validators.go:523/598) as a side effect. Two of the three pre-existing chaindb diagnostics from those notes (blocks.go:895 nilness, validatorepochsummaries.go:221/228/234/ 251 QF1012) are NOT addressed here -- "go fix" does not target those classes -- and remain candidates for separate focused commits. Files changed (14 total, 25 insertions, 37 deletions): - services/chaindb/postgresql/aggregatevalidatorbalances.go - services/chaindb/postgresql/blobsidecars.go - services/chaindb/postgresql/chainspec.go - services/chaindb/postgresql/proposerduties.go - services/chaindb/postgresql/setblobsidecars.go - services/chaindb/postgresql/setconsolidationrequests.go - services/chaindb/postgresql/setdepositrequests.go - services/chaindb/postgresql/setwithdrawalrequests.go - services/chaindb/postgresql/validators.go - services/chaintime/standard/service.go - services/eth1deposits/getlogs/service.go - services/scheduler/standard/service_test.go - services/summarizer/standard/epoch.go - services/summarizer/standard/handler_internal_test.go Notes for next iteration: - blocks.go:895 nilness diagnostic, flagged in b5e3646's notes, is not addressed by this sweep. Worth a focused commit. - validatorepochsummaries.go:221/228/234/251 QF1012 diagnostics, flagged in b5e3646's notes, are not addressed by this sweep ("go fix" does not target staticcheck quickfixes). Worth a focused commit. - The BeginTx/setMetadata/CommitTx envelope refactor in onEpochTransitionValidatorBalancesForEpoch (flagged across f6d1ef8, 1df1613, 5e16b88, b5e3646) remains open.
go vet's nilness analyser flags blocks.go:895's `if err != nil` as unreachable. Walking the code confirms it: at line 861, `err` is freshly bound from `rows.Scan(...)` and immediately checked at line 876. The non-nil branch returns; the nil branch falls through to the value-extraction copies (no operations between line 879 and 895 reassign or shadow `err`). By the time control reaches the second check, `err` is provably nil. Flagged in commit b5e3646's "Notes for next iteration" alongside two other pre-existing chaindb diagnostics (validators.go:523/598 slicessort, validatorepochsummaries.go:221/228/234/251 QF1012) left for "a dedicated cleanup pass." This is the focused commit for the nilness one. Single-pattern removal of three lines: - if err != nil { - return nil, err - } with the surrounding `copy(block.ETH1DepositRoot[:], ...)` and `blocks = append(blocks, block)` lines preserved verbatim. No control-flow change for any reachable input -- the deletion only removes a check that statically cannot fire. Key decisions: - Targeted to LatestBlocks only. A repo-wide grep for `if err != nil` in blocks.go returns 20+ matches; auditing each for analogous dead shape is a separate cleanup pass and is not in scope here. - No replacement comment. The pattern is a half-deleted copy-paste remnant from sibling functions in this file (which return `(nil, err)` after a different sequence); explaining its absence would document a non-event. - The fix does not change function shape: LatestBlocks still walks rows in a for-loop, scans each, populates a *chaindb.Block, and appends. No tests in this package exercise LatestBlocks directly (chaindb/postgresql has no internal tests), and the package compiles + the existing `gosilent test` is unaffected. Files changed: - services/chaindb/postgresql/blocks.go -- remove dead `if err != nil` block at the end of the LatestBlocks row-walk loop. Notes for next iteration: - Two chaindb diagnostics from b5e3646's notes remain unaddressed: validatorepochsummaries.go:221/228/234/251 QF1012 (next commit in this sweep). validators.go:523/598 slicessort was already closed by 67a61f2's go-fix sweep. - The LSP reports four further QF1012 sites in this same file (blocks.go:154/161/167/184) that were not in b5e3646's flagged set. Out of scope for this commit; candidate for a wider QF1012 pass if maintainers want the broader cleanup. - The "various if err != nil" copy-paste pattern across blocks.go is worth a focused audit for analogous dead-branch cases. Not attempted here.
staticcheck QF1012 flags four sites in ValidatorEpochSummaries (lines 221, 228, 234, 251) where the strings.Builder's content is appended via: queryBuilder.WriteString(fmt.Sprintf(`...%s...`, args...)) The recommended idiom writes directly into the builder via fmt.Fprintf, avoiding the intermediate string allocation: fmt.Fprintf(&queryBuilder, `...%s...`, args...) Flagged in commit b5e3646's "Notes for next iteration" alongside two other pre-existing chaindb diagnostics. This is the focused commit for those four QF1012 sites. The behavioural surface is unchanged. fmt.Fprintf into a strings.Builder writes the same bytes as fmt.Sprintf followed by WriteString -- the format-string semantics, the verb interpretation, and the resulting byte sequence are identical. The only observable difference is the absence of the intermediate string heap allocation, which the staticcheck quickfix targets. Key decisions: - Targeted to the four sites flagged in b5e3646's notes. Other WriteString(fmt.Sprintf(...)) call sites elsewhere in the chaindb package were not in the flagged set; auditing them all is a separate cleanup pass. - All four sites use the multi-line raw-string template (`... %s f_... = $%d`) for SQL fragment construction. The diff is a pure call-shape rewrite -- no whitespace adjustment, no template edits, no parameter reordering. - No new import. fmt was already imported in the file (line 16); the fmt.Fprintf change is internal to the same package usage. Files changed: - services/chaindb/postgresql/validatorepochsummaries.go -- four queryBuilder.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&queryBuilder, ...) rewrites in ValidatorEpochSummaries (lines 221, 228, 234, 251 in the pre-change file). Notes for next iteration: - The LSP reports four further QF1012 sites in blocks.go (lines 154/161/167/184), surfaced after the dead-err-check removal in the previous commit (ad5a2e7). Same pattern, same one-line rewrite per site. Worth a focused commit if maintainers want the broader QF1012 sweep. - All three chaindb diagnostics from b5e3646's "Notes" are now closed: validators.go:523/598 slicessort (closed by 67a61f2's go-fix sweep), blocks.go:895 nilness (closed by ad5a2e7), and validatorepochsummaries.go:221/228/234/251 QF1012 (this commit). - The BeginTx/setMetadata/CommitTx envelope refactor in onEpochTransitionValidatorBalancesForEpoch (flagged across f6d1ef8, 1df1613, 5e16b88, b5e3646) remains the next pickup.
…orEpoch The validator-balance fetcher opened its database transaction before the 1.3M-validator slice build, holding the MVCC snapshot and transaction ID open for the duration of a CPU-bound loop that does no database work. Flagged as a pre-existing structural quirk across four recent commits (f6d1ef8, 1df1613, 5e16b88, b5e3646) with the same prescription each time: Efficiency: BeginTx is acquired before the 1.3M-validator slice build, widening the transaction window beyond what is necessary. This commit closes that follow-up. Refactor shape: BeginTx moves from above the slice-build loop to just below the all-zero guard, immediately before the first SetValidatorBalances call. The transaction window now covers exactly the bulk insert + setMetadata + commit envelope, not the preceding pure-CPU loop. A consequence visible in tests: the two empty-response guards are now symmetric. Both fire BEFORE BeginTx, both short-circuit with no transaction to cancel, and the inline comment block above guard 2 is updated to reflect that ("No transaction is open at this point, so the rejection short-circuits cleanly"). The bulk-insert retry path at the original BeginTx-after-rollback site is unchanged: when SetValidatorBalances fails, the retry still cancels the failed transaction and opens a fresh one for the per-row fallback inserts. Test inversion (GREEN after RED): TestOnEpochTransitionValidatorBalancesForEpoch_AllZeroBalancesRejectedAndCursorUnchanged The pre-refactor assertions encoded the old contract: beginTxCalls == 1, cancelCalls == 1 ("guard 2 fires after BeginTx; the open transaction must be cancelled") Post-refactor assertions encode the new contract: beginTxCalls == 0, cancelCalls == 0 ("guard 2 fires before BeginTx; no transaction may open on this path") The docstring's prose was also rewritten -- previously it called out "Unlike guard 1, this path has already opened a transaction by the time it fires"; that asymmetry between the two guards is now gone, and the docstring says so explicitly. The other three test cases (EmptyValidatorsRejectedAndCursorUnchanged, MixedBalancesPersistsNonZero, BeaconErrorLeavesCursorUnchanged) keep their existing assertions and pass unchanged. Key decisions: - Transaction window shrinks but does NOT change atomicity. The bulk-insert + setMetadata + CommitTx envelope is still an atomic unit: either both the validator-balance rows land AND the cursor advances, or neither. ADR 0002's "atomic co-commit of cursor + balance rows" property is preserved; this refactor only changes WHEN the transaction opens, not what it covers. - Slice-build loop intentionally left outside the transaction even though it touches no database state, because moving it inside would re-introduce the wide-window cost. The slice is a function-local []*chaindb.ValidatorBalance with no shared state, so concurrent fetchers cannot observe it; transaction isolation buys nothing. - "cancel" name preserved on the post-move BeginTx return. The retry path at the bulk-insert failure site reassigns it (`dbCtx, cancel, err = s.chainDB.BeginTx(ctx)`); keeping the same variable name avoids cascading downstream changes. - Empty-response guards are now symmetric. ADR 0002's no-empty- write contract treated them as the same category at the contract level; this refactor reifies that symmetry at the implementation level. Future readers can reason about either guard without reaching for "but this one runs after BeginTx." Files changed: - services/validators/standard/handler.go -- move BeginTx below the slice-build loop and below the all-zero guard; update the inline comment to match the post-refactor "no transaction open" semantics. - services/validators/standard/handler_internal_test.go -- invert the AllZeroBalancesRejectedAndCursorUnchanged assertions (beginTxCalls/cancelCalls go from 1/1 to 0/0); rewrite the docstring to reflect the post-refactor symmetry between the two guards. Notes for next iteration: - The dead `if s.balances` block (closed by 1df1613) is an unrelated structural cleanup that has already landed; this commit is the second of the two pre-existing structural quirks flagged in f6d1ef8's notes. - The lag-gauge semantics question (whether a fetcher-vs-finality gauge should join the existing summarizer-vs-fetcher gauge, given the new no-empty-write contract) remains open per f6d1ef8's notes -- a design call for the observability owner, not actionable AFK. - The deterministic services/scheduler/standard TestManyJobs deadlock-detector trip remains open; investigated as a go-deadlock false positive triggered by the 2048-goroutine contention, but not fixed in this commit.
staticcheck QF1012 flags four sites in Blocks() (lines 154, 161, 167, 184) where the strings.Builder's content is appended via: queryBuilder.WriteString(fmt.Sprintf(`...%s...`, args...)) The recommended idiom writes directly into the builder via fmt.Fprintf, avoiding the intermediate string allocation: fmt.Fprintf(&queryBuilder, `...%s...`, args...) Flagged in commit 0486b6a's "Notes for next iteration" -- these four sites were surfaced after the dead-err-check removal in ad5a2e7 and are the same pattern that 0486b6a applied to validatorepochsummaries. This is the focused commit for those four QF1012 sites. The behavioural surface is unchanged. fmt.Fprintf into a strings.Builder writes the same bytes as fmt.Sprintf followed by WriteString -- the format-string semantics, the verb interpretation, and the resulting byte sequence are identical. The only observable difference is the absence of the intermediate string heap allocation, which the staticcheck quickfix targets. Key decisions: - Targeted to the four sites flagged in 0486b6a's notes, all inside Blocks() (the BlockFilter query-builder switch). golangci-lint still reports staticcheck QF1012 in attestations.go:720 and beaconcommittees.go:91/98 -- those are the next candidates for the package-wide sweep that 0486b6a's notes anticipated, but auditing all ~41 WriteString(fmt.Sprintf) call sites across chaindb/postgresql is a separate cleanup pass. - All four sites use the multi-line raw-string template (`...%s f_... = $%d` and `...LIMIT $%d`) for SQL fragment construction. The diff is a pure call-shape rewrite -- no whitespace adjustment, no template edits, no parameter reordering. Identical mechanical shape to commit 0486b6a. - No new import. fmt was already imported in the file (line 20); the fmt.Fprintf change is internal to the same package usage. - The unrelated wsl_v5, noinlineerr, and revive issues that golangci-lint reports in this file (lines 54, 103, 108, 262) are NOT addressed here -- they predate the QF1012 cleanup pass and live on different lines. Out of scope. Files changed: - services/chaindb/postgresql/blocks.go -- four queryBuilder.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&queryBuilder, ...) rewrites in Blocks() (lines 154, 161, 167, 184 in the pre-change file). Notes for next iteration: - attestations.go:720 and beaconcommittees.go:91/98 (and potentially the unflagged sites at 105/122) report staticcheck QF1012 in current golangci-lint output. Same one-line rewrite per site; candidate for a wider focused commit if maintainers want the package-wide cleanup. - The deterministic services/scheduler/standard TestManyJobs deadlock-detector trip (flagged across f6d1ef8, b5e3646, 8be1f3c) remains open. Per standing direction not to silence go-deadlock false positives in tests or production, it awaits an upstream go-deadlock fix or new direction before the candidate fix paths (Opts callback reconfiguration, dependency drop, test-load pinning) can be applied. - The lag-gauge semantics question (whether a fetcher-vs-finality gauge should join the existing summarizer-vs-fetcher gauge, given the no-empty-write contract introduced in this branch's validator-balance fetcher work) remains open per f6d1ef8's notes -- a design call for the observability owner, not actionable AFK. - The .golangci.yml v2 migration's deprecation warning for "gomodguard" still fires on every lint run (per ddbb6c4's notes); cosmetic-only follow-up.
services/scheduler/standard's TestManyJobs (service_test.go:438)
schedules 2048 jobs whose timer goroutines wake at the same instant
and contend for s.jobsMutex. Under v0.3.5 of
github.com/sasha-s/go-deadlock the detector trips deterministically
at any n >= 16 concurrent jobs on darwin, killing the test process
via the default Opts.OnPotentialDeadlock = os.Exit(2).
Two distinct false-positive shapes reproduce on the v0.3.5 baseline:
Shape A -- "Recursive locking" with two identical stack frames
from the same line/function context (a real recursive
lock by a single goroutine cannot produce identical
stacks; this is two different goroutines whose IDs the
detector has confused).
Shape B -- "happened after" with both holders attributed to
"goroutine 0" (the detector's placeholder when its
goroutine-ID parser falls back), reporting concurrent
holders of distinct job.stateLock instances plus the
shared jobsMutex.
Walking the lock pattern in services/scheduler/standard/service.go
confirms no real circular order:
- ScheduleJob.func1 (timer branch): acquires jobsMutex alone.
Calls finaliseJob later; finaliseJob acquires stateLock alone.
- RunJob / CancelJob: acquire jobsMutex, release before touching
stateLock. No nested holding.
- runJob: acquires stateLock alone.
Order across functions is jobsMutex -> (release) -> stateLock in
every reachable code path; no path holds both simultaneously. Both
the "recursive" shape (identical stack frames) and the "goroutine 0"
placeholder point at go-deadlock's goroutine-ID-from-runtime.Stack
parser, which has known limits when many short-lived goroutines
contend at once.
The fix consumes upstream go-deadlock work rather than adding any
chaind-side workaround:
v0.3.7 (2026-03-07) fixes "concurrent lock tracking problems
during contention" -- the load-bearing change
that resolves the false positive.
v0.3.8 (2026-03-17) replaces the goroutine-per-lock detection
mechanism with time.AfterFunc + sync.Pool,
removing the parser code path that the
"goroutine 0" symptom pointed at, but
introduces -race unit-test failures.
v0.3.9 (2026-03-18) ships one day after v0.3.8 to "fix existing
-race unit test failures" on the same rework.
Stable endpoint of the v0.3.7+ fix range.
Probe outcome on the development laptop:
Baseline (v0.3.5): 0/5 PASS (5/5 deterministic FAIL with the
POTENTIAL DEADLOCK report)
Upgraded (v0.3.9): 5/5 PASS, then 10/10 PASS
(15/15 total, no POTENTIAL DEADLOCK reports,
original jobs := 2048 constant unchanged)
gosilent test ./... (full suite, v0.3.9):
87/87 PASS, no NEW lock-ordering issues
surfaced by the upgraded detector
Key decisions:
- Bump rather than mutex swap, deadlock.Opts override, or test-load
reduction. The standing project direction is not to silence
go-deadlock false positives in tests or production -- the bump is
the smallest action that does not violate that direction; the
upstream detector remains active and now correctly reports no
false positives on chaind's lock pattern.
- v0.3.9 rather than v0.3.7 (the minimum-sufficient fix). v0.3.7
alone would close the false positive, but v0.3.8's rework + its
v0.3.9 -race fix all ship within an 11-day window with a public
-race regression in the v0.3.8 middle. Landing on v0.3.9 picks
up the fix + rework + rework's follow-up in one bump rather than
freezing chaind at v0.3.7 against an actively-evolving upstream.
The API surface chaind uses (deadlock.Mutex / deadlock.RWMutex,
no Opts configuration, no callback registration) is stable across
the entire v0.3.x series, so the rework is invisible at chaind's
call sites.
- petermattis/goid (transitive dependency of go-deadlock) bumped
alongside, matching the goid pseudo-version pinned in go-deadlock
v0.3.9's own go.mod. Mechanically required by go mod tidy, not
over-specified.
- No source change in this commit. The two existing call sites in
services/scheduler/standard/service.go (line 32:
stateLock deadlock.Mutex; line 46: jobsMutex deadlock.RWMutex)
remain byte-identical. The behaviour change is entirely in the
upstream library.
Files changed:
- go.mod -- github.com/sasha-s/go-deadlock v0.3.5 -> v0.3.9;
github.com/petermattis/goid pseudo-version bumped to
the goid release pinned by go-deadlock v0.3.9.
- go.sum -- corresponding hash replacements for both modules
(4 stale lines removed, 4 new lines added; no other
module hashes touched). go mod verify confirms all
modules verified against the public Go checksum DB.
Notes for next iteration:
- services/scheduler/standard/service.go retains 21 pre-existing
golangci-lint findings (wsl_v5, noinlineerr, revive) that are
unrelated to this bump and were already documented as out of
scope across the recent QF1012 sweep commits. Candidate for a
focused style-pass commit if maintainers want the broader
cleanup.
- The .golangci.yml v2 migration's deprecation warning for
"gomodguard" still fires on every lint run; cosmetic-only
follow-up tracked separately as a lint-config-hygiene candidate.
- The remaining QF1012 sites in attestations.go and
beaconcommittees.go (5 sites total) remain candidates for the
next focused commit in the QF1012 sweep started in 0486b6a and
9c6bd55.
Commit ddbb6c4 (".golangci.yml v1->v2 migration") landed the v2 schema and flagged three pre-existing drift items as out of scope and candidates for a focused follow-up commit. Each of the three has since reproduced in every golangci-lint run as either a `level=warning` emission or a now-unsuppressed staticcheck finding: warning: The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2. warning: [runner/nolint_filter] Found unknown linters in //nolint directives: stylecheck ST1005: error strings should not be capitalized (3 visible sites that //nolint:stylecheck no longer suppresses in v2) This commit closes all three. 1. gomodguard added to the linters.disable list. v2.12.0 deprecated gomodguard in favour of gomodguard_v2. chaind has no linters.settings.gomodguard block and no linters.settings.gomodguard_v2 block -- there is no configured module-restriction policy, so the linter is enabled-by-default (via linters.default: all at line 11) but enforces nothing. Adding gomodguard to the disable list reflects the actual no-op state and silences the deprecation warning. Renaming to gomodguard_v2 would relocate the no-op without adding value. Inserted alphabetically between goconst and inamedparam to preserve the existing sort. 2. Four //nolint:stylecheck directives renamed to //nolint:staticcheck. v2 merged the v1 stylecheck linter into staticcheck (single linter under the staticcheck name in v2's namespace). The four //nolint:stylecheck directives target an unknown-in-v2 linter and currently suppress nothing -- the surrounding `// skipcq: SCC-ST1005` trailers also do not suppress them (skipcq is a SonarCloud/CodeQ tag, not a golangci-lint nolint directive). Renaming each to //nolint:staticcheck restores the original suppression intent under v2's namespace. Sites: services/synccommittees/standard/parameters.go:113, 117 services/beaconcommittees/standard/parameters.go:105, 109 The ST1005 finding being suppressed at all four sites is the capitalized error string `"Ethereum 2 client does not provide ..."`. These error strings are constant English literals returned by the parameter-parsing constructors when the eth2client argument fails the required interface assertion; they are intentionally human-readable (the SCC-ST1005 skipcq tag documents the same intent for a different toolchain). No ST1005 finding outside these four lines is masked by the rename. 3. Dead lll.line-length: 132 settings block removed from linters.settings. .golangci.yml had `lll.line-length: 132` settings, but `lll` is in the disable list (line 26 in the pre-change file). With the linter disabled the settings can never apply. The block is pure dead config; removing the two lines silences nothing visible (lll never ran) but eliminates a genuine config inconsistency that would mislead a future reader auditing the settings. No replacement -- the maximum-line-length policy was already not being enforced. Behavioural surface: Pre-change golangci-lint run ./...: - gomodguard deprecation warning: present - "Found unknown linters: stylecheck" warning: present - issue count: 67 (staticcheck: 8, of which 3 are ST1005 on the four-site source listed above) Post-change golangci-lint run ./...: - gomodguard deprecation warning: silenced - "Found unknown linters: stylecheck" warning: silenced - issue count: 64 (staticcheck: 5, the 3 ST1005 sites are now correctly suppressed by the renamed nolint directives; no new findings) go build ./..., go vet ./..., gosilent test ./...: all clean (87 tests pass, same as the pre-change baseline carried since 6a148f6's full-suite run). Key decisions: - gomodguard disable rather than rename to gomodguard_v2. Because no module-restriction policy is configured anywhere in the repo (verified: no linters.settings.gomodguard block, no linters.settings.gomodguard_v2 block), a rename would relocate a no-op to a new namespace without changing observable behavior. Disable reflects the real intent. - All four //nolint:stylecheck sites in scope, even though golangci-lint v2 only flags three of the four ST1005 warnings in baseline output (the synccommittees parameters.go:117 directive's target line at 118 does not currently surface in staticcheck output; whether by truncation or by some other filter is unclear, but the same `// skipcq: SCC-ST1005` trailer pattern at the other three sites is also flagged, so the absence is not a structural distinction). Renaming all four keeps the file self-consistent and matches the spec. - All three drift items bundled into a single commit. They share the same root cause (the v1->v2 migration in ddbb6c4) and the same review thread. Splitting them would not aid reviewability -- each is small (1-3 lines) -- and would create three near- identical commit messages. - No drive-by edits. The other settings entries under linters.settings (gosec, nlreturn, staticcheck, tagliatelle) are intentionally left as-is in this commit, including a structurally identical orphan that surfaced during review (see Notes below). Strict scope discipline matches the spec list of three drift items; the orphan is tracked for the next focused commit. Files changed: - .golangci.yml -- gomodguard added to linters.disable (insert at alphabetical position); lll.line-length: 132 removed from linters.settings (2 lines). Net: +1 / -2. - services/beaconcommittees/standard/parameters.go -- two //nolint:stylecheck -> //nolint:staticcheck rewrites at lines 105 and 109 (the BeaconCommitteesProvider and EventsProvider type-assertion paths). No other content change; the trailing `// skipcq: SCC-ST1005` comments on the errors.New(...) lines preserved verbatim. - services/synccommittees/standard/parameters.go -- two //nolint:stylecheck -> //nolint:staticcheck rewrites at lines 113 and 117 (the SyncCommitteesProvider and EventsProvider type-assertion paths). No other content change; the trailing `// skipcq: SCC-ST1005` comments on the errors.New(...) lines preserved verbatim. Notes for next iteration: - linters.settings.nlreturn.block-size: 3 (lines 41-42 in the post-change file) is structurally identical to the lll orphan removed by this commit: nlreturn is in the disable list (line 31), so the block-size setting can never apply. Pre-existing, not introduced here, but a natural pickup for the next focused config-hygiene commit -- mechanically the same 2-line removal. - The QF1012 sweep across attestations.go (1 site) and beaconcommittees.go (4 sites) in services/chaindb/postgresql, flagged across 0486b6a and 9c6bd55's "Notes for next iteration", remains the next candidate refactor commit. - The wider QF1012 sweep across the remaining ~36 sites in 12 other chaindb/postgresql files is gated on enabling stricter staticcheck QF rules in .golangci.yml (chaindb/postgresql files outside attestations.go/beaconcommittees.go/blocks.go/ validatorepochsummaries.go are not currently flagged by the default rule set).
staticcheck QF1012 flags five sites across attestations.go and beaconcommittees.go where the strings.Builder's content is appended via: queryBuilder.WriteString(fmt.Sprintf(`...%s...`, args...)) The recommended idiom writes directly into the builder via fmt.Fprintf, avoiding the intermediate string allocation: fmt.Fprintf(&queryBuilder, `...%s...`, args...) Flagged in commit 9c6bd55's "Notes for next iteration" -- those five sites were the next pickup in the QF1012 sweep started in 0486b6a (validatorepochsummaries) and continued in 9c6bd55 (blocks). This is the focused commit for those sites. The behavioural surface is unchanged. fmt.Fprintf into a strings.Builder writes the same bytes as fmt.Sprintf followed by WriteString -- the format-string semantics, the verb interpretation, and the resulting byte sequence are identical. strings.Builder.Write is documented to never return an error, so the discarded (n, err) return from fmt.Fprintf is also safe and matches how the prior two QF1012 commits (0486b6a, 9c6bd55) handled the same shape. The only observable difference is the absence of the intermediate string heap allocation per call site, which is exactly what staticcheck QF1012 targets. Sites: attestations.go -- 1 site: Limit-clause append in Attestations() (line 720 in the pre-change file). beaconcommittees.go -- 4 sites: From-bound (line 91), To-bound (line 98), CommitteeIndices ANY-clause (line 105), and Limit-clause (line 122) appends in BeaconCommittees(). Trust-boundary check confirms the rewrite introduces no new SQL injection surface. The two interpolated arguments at every site are either the closed-enum string `wherestr` (assigned only to the literal "WHERE" or " AND" -- no path from user input) or the non-injectable integer `len(queryVals)` (an int-typed slice length formatted with %d). User-supplied filter values continue to flow through the queryVals bind-parameter slice and are passed to tx.Query as positional parameters, never interpolated into the query string. Same pattern as before the rewrite. Key decisions: - Targeted to the five sites flagged in 9c6bd55's notes. Sites 105 and 122 in beaconcommittees.go are byte-for-byte identical patterns to 91 and 98 (the same `WriteString(fmt.Sprintf(...))` shape, the same SQL-fragment template family); golangci-lint's staticcheck integration only surfaces three of the four beaconcommittees sites in default output, but landing all four in the same commit keeps the file self-consistent rather than mixed. File-level consistency outweighs strict "only-flagged-sites" scope. - All five sites use the multi-line raw-string template (`... %s f_... = $%d`, `...LIMIT $%d`) for SQL-fragment construction. The diff is a pure call-shape rewrite -- no whitespace adjustment, no template edits, no parameter reordering. Identical mechanical shape to commits 0486b6a and 9c6bd55. - No new import. fmt was already imported in both files (attestations.go line 19; beaconcommittees.go line 18); the fmt.Fprintf change is internal to the same package usage. - Pointer receiver on the strings.Builder argument (&queryBuilder). strings.Builder's Write method has a pointer receiver, so passing &queryBuilder satisfies io.Writer with no interface boxing of the receiver value. Passing queryBuilder by value would not compile (Builder's no-copy invariant forbids copying after first use, enforced by the noCopy field embedded in the type). - The unrelated wsl_v5, noinlineerr, and revive issues that golangci-lint reports in attestations.go are NOT addressed here -- they predate the QF1012 cleanup pass and live on different lines. Out of scope. The four ST1005 nolint sites cleaned up in 32e48c1 are also unrelated to the QF1012 sweep. Files changed: - services/chaindb/postgresql/attestations.go -- one queryBuilder.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&queryBuilder, ...) rewrite at line 720 in the pre-change file (Attestations() Limit clause). - services/chaindb/postgresql/beaconcommittees.go -- four queryBuilder.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&queryBuilder, ...) rewrites at lines 91, 98, 105, and 122 in the pre-change file (BeaconCommittees() From, To, CommitteeIndices, and Limit clauses). Behavioural surface: Pre-change golangci-lint run ./services/chaindb/postgresql/: - QF1012 hits in attestations.go: 1 (line 720) - QF1012 hits in beaconcommittees.go: 3 (lines 91, 98, 122 in default staticcheck output; line 105 same shape, surfaced via grill audit) Post-change golangci-lint run ./services/chaindb/postgresql/: - QF1012 hits in attestations.go: 0 - QF1012 hits in beaconcommittees.go: 0 - QF1012 hits in blobsidecars.go: 3 (lines 70, 75, 80; unchanged from baseline -- explicitly out of scope, see notes below) go build ./..., go vet ./..., gofmt -s -l on the two touched files: all clean. gosilent test ./...: 87/87 PASS, same baseline carried since 6a148f6's full-suite run. Notes for next iteration: - blobsidecars.go:70/75/80 are flagged QF1012 in current golangci-lint output. Same one-line rewrite per site; candidate for the next focused commit if maintainers want the blobsidecars cleanup. - The wider package-wide sweep across the remaining ~33 unflagged sites in 11 other chaindb/postgresql files (e.g. validators.go, proposerduties.go, syncaggregates.go, etc.) is gated on enabling stricter staticcheck QF rules in .golangci.yml. Those files are not currently surfaced by the default rule set. - The .golangci.yml hygiene cleanup landed in 32e48c1 closed the three v1->v2 migration drift items; the structurally-identical nlreturn.block-size: 3 orphan flagged in 32e48c1's notes is the next mechanical config-hygiene pickup. - The deterministic services/scheduler/standard TestManyJobs go-deadlock false positive was closed by 6a148f6's bump of go-deadlock from v0.3.5 to v0.3.9; no further action needed there.
Several comments accumulated during the lag-observability work grew into
multi-paragraph essays that re-told the PR description: bootstrap-vs-
production-state guard explanations, full ADR-0002 retellings on test
docstrings, the cursor-diff formula derivation in three places, and the
exhaustive Deprecated: blocks on the three MissedEpochs fields each
listed the other two services and instructed readers to "file an issue
to remove the field." Per the project's "no comments unless the why is
non-obvious" convention, trim each to the load-bearing why and drop
references that rot (commit hashes, issue numbers, file:line pointers,
"added for X" framing).
What stays:
* The LEFT JOIN duality at the two summarizeEpoch guard sites — the
bootstrap path and the post-accumulation path are non-obvious from
reading the SQL alone.
* The alert-rule message-text contract notes at each warn-log site —
a hidden constraint that an external alert rule pattern-matches on
these strings.
* The clampLag uint64 underflow rationale — the ~1.8e19 phantom value
is the surprising behavior the float64 conversion guards against.
* The defer ... LIFO ordering note in OnFinalityUpdated — would
surprise a reader who doesn't track Go's defer semantics.
What goes:
* Multi-paragraph "added in commit X / issue Y" framings.
* Cross-file inventories on the three Deprecated: MissedEpochs blocks.
* Test docstrings that re-state what the test name already says.
* Sub-test branch comments where the case name describes the scenario.
* Speculative future-proofing rationales (upstreamMetadata "exists so
future fields can be added").
Net: -170 comment lines across 12 files (-255 / +85). Build, gosilent
test, go vet, and gofmt all clean against the trimmed state.
Files:
services/finalizer/standard/metadata.go
services/proposerduties/standard/metadata.go
services/summarizer/standard/epoch.go
services/summarizer/standard/handler.go
services/summarizer/standard/handler_internal_test.go
services/summarizer/standard/metadata.go
services/summarizer/standard/metrics.go
services/summarizer/standard/validatorday.go
services/summarizer/standard/validatorday_internal_test.go
services/validators/standard/handler.go
services/validators/standard/handler_internal_test.go
services/validators/standard/metadata.go
The v1->v2 schema migration in commits ddbb6c4 / 32e48c1 made .golangci.yml v2-only (`version: "2"`, top-level `formatters`, `linters.default`, `linters.exclusions.presets`), but the workflow still pinned `golangci/golangci-lint-action@v6` which is locked to v1.x binaries. CI's lint job consequently failed at config-verify time before any linter ran, with the binary rejecting four v2-schema fields: jsonschema: "output.formats" does not validate ... got object, want array jsonschema: "linters.exclusions" ... additional properties 'presets' not allowed jsonschema: "linters" ... additional properties 'default', 'settings' not allowed jsonschema: "" ... additional properties 'formatters', 'version' not allowed Bumping to `@v8` switches the action to a v2-default binary (`latest` now resolves to the most recent v2.x release). The action's input schema for `version`, `args`, `only-new-issues`, and `skip-cache` is unchanged from v6. Once v2 actually runs, two newly-default-enabled linters fire on PR-touched lines: * `wsl_v5` — the next major version of the whitespace linter. `wsl` is already in the disable list (the codebase doesn't enforce wsl-style spacing rules); `wsl_v5` is a separate linter under v2's namespace and needs its own entry. * `noinlineerr` — flags `if err := f(); err != nil` against plain assignment. The codebase uses inline error handling pervasively; this is idiomatic Go style chaind has always leaned on, so adding the linter to the disable list reflects actual policy. Both additions match the existing pattern set by 32e48c1 (disable `gomodguard` rather than configure it; the codebase has no module-restriction policy to encode). Inserted alphabetically. Verification: golangci-lint config verify → no output (clean) golangci-lint run --new-from-rev=master ./... → 0 issues go build ./... → clean gosilent test ./services/... → ok all packages Files: .github/workflows/golangci-lint.yml .golangci.yml
Replace two inline errors.New("beacon returned no validator balances")
allocations with a package-level sentinel. Callers can now identify the
rejection via errors.Is rather than substring matching, and the two sites
are guaranteed to return the same value. Pairs with the existing
noValidatorBalancesMsg constant for the operator-facing log line.
The summarizer- and validators-side internal tests added on this branch swap the package-level log variable (and, for the summarizer handler test, the summarizerLagEpochs gauge) and restore the originals on teardown. Add a top-of-file note in each file warning future contributors not to add t.Parallel() to any test here, since the global mutation would race across goroutines. No behavioural change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the
chaind_summarizer_lag_epochs{pipeline}gauge so silent stalls in the epoch / block / validator pipelines become alertable within minutes rather than discoverable only on dashboard review (motivated by the 2026-04-07 Hoodi multi-week silent-stall incident). The metric shape and rationale are recorded in a new ADR (docs/adr/0001-summarizer-observability-shape.md), the first ADR in this repo — please confirm the MADR template choice in review. Commit 6 extends the zero-balance assertion from the epoch-summarizer (added in commit 3) to the day-summarizer invalidatorday.go, closing a residual landmine from the sameValidatorBalancesByEpochLEFT JOIN shape; see Notes for reviewers for the factoring rationale.The branch additionally lands the upstream fix for the silent-cursor-advance defect that produced the incident. A new ADR (
docs/adr/0002-validator-balance-fetcher-recovery-model.md) records the validator-balance fetcher's persistence-and-recovery contract; commit 8 lands a four-case bug-confirming test asserting the buggy shape; commit 9 inverts cases 1 and 2 to the post-fix contract by adding two no-empty-write guards inservices/validators/standard/handler.go. Together with the summarizer-side guards (commits 3 and 6), the silent-corruption pathway is closed end-to-end: empty / all-zero beacon responses are rejected at the fetcher boundary, and any zero-balance shape that slips through is rejected again at both summarizer layers as defence in depth.End-to-end verification was performed on a seeded chaind instance against Hoodi: DELETE all
t_validator_balancesrows for one finalized epoch, observe the gauge grow per finality tick, both alert-contractWarnlogs fire verbatim, the cursor stay pinned, no corrupt row written tot_epoch_summaries; restore the balances and confirm the cursor catches up.Test plan
/metricsexposeschaind_summarizer_lag_epochs{pipeline=epoch|block|validator}after the firstOnFinalityUpdatedtick on a running instance, with the# HELPline matchingdocs/prometheus.mdt_validator_balancesrows for one epoch are deleted, both alert-contractWarnlogs fire verbatim each tick:No validator balances available; cannot summarize epoch (will retry on next finality tick)andNot enough data to update summary; will retry on next finality ticksummarizer.standard.latest_epochdoes not advance past the stuck epoch while balances are missing; no row written tot_epoch_summariesfor that epochs.blockSummaries == false) produce no series for that labelBalanceis uniformly zero,services/validators/standard/handler.go:onEpochTransitionValidatorBalancesForEpochrefuses to advancemd.LatestBalancesEpoch, cancels any open transaction (case 2 only — case 1 fires beforeBeginTx), and emitsBeacon returned no validator balances; cannot persist epoch (will retry on next finality tick)at WarnLevel. Verified via the four-case table-driven test inservices/validators/standard/handler_internal_test.go— cases 1 and 2 invert from the bug-confirmation assertions in commit 8 to the post-fix contract assertions in commit 9; cases 3 (mixed validators) and 4 (beacon error) are unchanged positive/negative sanity tests. Commit 10's dead-block dedent is exercised by the same four cases without modification.Notes for reviewers
docs/adr/0002-validator-balance-fetcher-recovery-model.md) follows ADR 0001's style.len(balances) == 0guard atepoch.go:78is retained alongside the newActiveValidators > 0 && ActiveBalance == 0guard — the two cover the bootstrap path (validators table not yet populated) and the production-state path (LEFT JOIN returns zero-balance rows for empty balance set) respectively. Both share one warn-log message so the alert annotation contract is single-sourced.validatorBalancesPresentcollapsing both bootstrap (len == 0) and production-state (all-zero rows) shapes, slightly different factoring fromepoch.gobecause the day layer has no separate active-balance sum to compute. Warn-log text is reused verbatim so the same alert rule pattern-matches both sites; theTime("start_time"|"end_time")zerolog fields disambiguate the day layer for operators. Test (validatorday_internal_test.go) is a structural-impossibility witness: drivesaddValidatorBalanceSummariesdirectly with 1000 validators returningBalance=0at each call site, asserts(false, nil), warn fires verbatim, and theSetValidatorDaySummariesrecorder stays empty. Runs unconditionally undergosilent test ./..., no env gating.6cebf28) anchors the^TestService$regex in the test-skip pattern from commit 1.-test.skipis unanchored, so the previous"TestService"pattern would also skip any future test whose name contains that substring (e.g. a hypotheticalTestServiceConfiguration). One-line correctness fix.docs/prometheus.mdwith all sevenchaind_summarizer_*metrics, including six that pre-date this branch. If you'd prefer that backfill be split into its own PR, happy to drop it from this one — it's standalone in commit 4.7b5b083) is a doc-only drive-by: thePruneValidatorBalancesandPruneValidatorEpochSummariesinterface and impl comments claimed "up to (but not including) the given epoch" but the SQL isf_epoch <= $1(inclusive). Doc was updated to match the SQL (the inverse — changing the SQL — would silently retain one extra epoch per prune cycle, a behavior change with retention/disk implications, and no caller depends on the exclusive reading). Happy to drop this commit from the PR if you'd prefer it land separately — same offer as commit 4'sdocs/prometheus.mdbackfill.7e9f3b6) closes the root-cause investigation slice with three in-tree artifacts: ADR 0002 (docs/adr/0002-validator-balance-fetcher-recovery-model.md) recording the fetcher's persistence-and-recovery contract; a four-case bug-confirming table-driven test inservices/validators/standard/handler_internal_test.go(the first test file in this package, modelled on the summarizer's stub patterns); and// Deprecated:comments on the vestigialMissedEpochsfield acrossservices/validators/standard/metadata.go,services/finalizer/standard/metadata.go, andservices/proposerduties/standard/metadata.go— declared, JSON-migrated for backward compat, never populated by code in this repo. Theproposerdutiesdeprecation comment additionally flagsservice.go:166-204(handleMissed) as dormant code retained alongside the field. No behavioural change in this commit; the test confirms the bug exists and serves as an empirical anchor for ADR 0002's no-empty-write claim. ADR 0002 may need an "Amendment 1" section (same precedent as ADR 0001) depending on whether reviewers prefer "ADR matches as-shipped behaviour" or "ADR documents intended contract, fix tracks compliance"; happy to follow upstream preference.f6d1ef8) implements the no-empty-write fetcher fix that ADR 0002 records. Two narrow guards inservices/validators/standard/handler.go:onEpochTransitionValidatorBalancesForEpoch: (1) rejectlen(validators) == 0beforeBeginTx— no transaction opens on this path; (2) rejectlen(dbValidatorBalances) == 0after thevalidator.Balance == 0filter loop, callingcancel()on the open transaction so its writes do not commit. Both sites use a singlenoValidatorBalancesMsgpackage constant so the alert-contract is compile-time enforced — a rename produces a build error rather than silent test/prod drift. The error returned propagates throughonEpochTransitionValidatorBalancestoOnBeaconChainHeadUpdatedwhere the existinglog.Warn().Err(err).Msg("Failed to update validators")gives operator visibility. Cases 1 and 2 of the bug-confirming test (commit 8) are inverted in this commit to assert the post-fix contract; cases 3 and 4 are unchanged. Two structural notes flagged during this commit's review were addressed in follow-up commits (1df1613,5e16b88,8be1f3c,b5e3646) — see the next bullet.1df1613) removes the dead innerif s.balancesblock inservices/validators/standard/handler.go:onEpochTransitionValidatorBalancesForEpoch. Pure dedent: contents of the inner block are dedented one level; no statement is moved, added, deleted, or reordered. The block was structurally unreachable because the only production caller,onEpochTransitionValidatorBalances, already short-circuits withif !s.balances { return nil }. The four-case table-driven test from commits 8/9 continues to pass without modification. Three pre-existing minor findings noted during the commit-9 / commit-10 reviews were subsequently landed as their own narrow commits to keep blame attribution clean:5e16b88cachess.chainTime.FirstSlotOfEpoch(epoch)into afirstSlotlocal (was called three times for the same epoch);8be1f3cmovesBeginTxbelow thedbValidatorBalancesslice build so the transaction window no longer wraps the 1.3M-validator filter loop; andb5e3646aligns the inline-comment vocabulary on the second guard with ADR 0002's "empty-response" terminology used elsewhere in the file.1edcb45) measuresvalidators.standard.LatestBalancesEpoch - summarizer.standard.LastEpoch. With the fetcher fix in commit 9, when the beacon stays empty the fetcher refuses to advance, the summarizer also stalls (it depends ont_validator_balances), and the diff stays at zero — so the existing alert does not fire on a chronically-stuck fetcher. Whether to add a separate fetcher-vs-finality lag gauge as an alert path, or treat "summarizer freezes alongside fetcher" as the correct semantics for the existing gauge, is open for the maintainer's preference. Not a blocker for this PR; the no-empty-write guards make the stall visible at the source via the warn-log alert-contract regardless.errNoValidatorBalancesas a package-level sentinel inservices/validators/standard/handler.goso both rejection sites return the same value (callers canerrors.Israther than string-match), and add a top-of-file note to the three new internal-package test files documenting that they mutate package-level state (log,summarizerLagEpochs) and must not be run witht.Parallel().