Skip to content

Shrinked QEB#241

Open
gshigin wants to merge 150 commits into
ppfrom
shrinked_encoding_bimap
Open

Shrinked QEB#241
gshigin wants to merge 150 commits into
ppfrom
shrinked_encoding_bimap

Conversation

@gshigin
Copy link
Copy Markdown
Collaborator

@gshigin gshigin commented Feb 19, 2026

No description provided.

@gshigin gshigin requested a review from cherep58 February 19, 2026 15:38
@gshigin gshigin self-assigned this Feb 19, 2026
Copy link
Copy Markdown
Collaborator

@vporoshok vporoshok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review in the context of .wiki/adr/2026-02-24-shrink-on-rotation.md

TL;DR

The PR matches the ADR overall: Normal/Fixed/Shrunk states in QueryableEncodingBimap, ShrinkState, PostShrinkSnapshotResolver, SymbolSource / IndexWriteContext, ShrinkAwareSnapshotLSS, and the shrink_shard_copier feature flag wired through FreezeAndCopyAddedSeriesFinalizeCopyAndShrink. Test coverage is solid and the style remarks from @cherep58 / @u-veles-a are largely addressed.

Items that anchor to specific lines are moved to inline comments. The non-anchored items are listed below.


2. Stale reverse_index_ / trie_index_ in Fixed / Shrunk → phantom ids in selector results

prune_hidden_series_before_fixed_state cleans up ls_id_hash_set_ and ls_id_set_ only; Querier::query walks reverse_index() / trie_index(). After resolving the empty-LabelSet discussion (see the inline comment on is_hidden_in_fixed_state), this is "correct" in Fixed (labels are real). In Shrunk it relies on the Go layer to drop series with an empty LabelSet.

Decision: accepted as a minor risk and recorded in the ADR. For /api/v1/series the response may include ids of series whose labels are already hidden; for all other endpoints the rows are filtered out by DataStorage and not user-visible.

Action item: add an e2e test that queries the LSS in Shrunk with a matcher hitting both mapped and unmapped ids, demonstrating that an empty LabelSet is filtered correctly on the Go side.

10. is_normal() / is_fixed() contract — externally, FixedNormal

After the empty-LabelSet decision (#1), no QEB method branches on is_fixed() (only state-transition asserts); is_normal() is not used in logic at all. Fixed is effectively an internal-only marker meaning "pending_shrink_boundary is set, awaiting finalize_copy_and_shrink".

Method Normal Fixed Shrunk
find / find_or_emplace yes yes (hash-set already pruned, search path identical) yes
resolve_impl(id) storage_composite(id) storage_composite(id) (after #1) resolve_shrunk_series
symbol_source_for_series(id) kCurrent kCurrent kCurrent for id ≥ shift, kSnapshot otherwise
for_each_snapshot_symbol_id no-op no-op walks resolver

Action item — please pick a direction and record it in the ADR:

  • (A) Keep the 3-state enum, add a docstring above enum State / is_fixed(): "Fixed is externally indistinguishable from Normal; it is an internal marker that records pending_shrink_boundary between set_pending_shrink_boundary and finalize_copy_and_shrink."
  • (B) Replace enum State with bool is_shrunk_ + std::optional<uint32_t> pending_shrink_boundary; "Fixed = boundary set && !shrunk" becomes a structural invariant rather than an enum value.
  • (C) Leave as is without comments (nit — three states were chosen intentionally).

11. Grouping of logic inside ShrinkState

@cherep58's note "operations with shrink_state may be moved to a method of ShrinkState" is largely addressed, but mark_active still lives in QEB though it is conceptually about series activity. Optional: move it to ShrinkState::mark_active(added_series_, ls_id), or leave it with a one-line comment explaining the choice.

13. Test coverage of QueryableEncodingBimap

Missing cases:

  • repeated set_pending_shrink_boundary (Fixed → Fixed) — a legal path once the assert is relaxed (see inline comment on enter_fixed_state);
  • finalize_copy_and_shrink without a prior set_pending_shrink_boundary (expected behaviour, likely an assert);
  • resolve_shrunk_series(id < shift) when post_shrink_snapshot_resolver == nullptr (must yield the empty LabelSet).

14. Test coverage of RotateLSSSuite

  • query against the old LSS after FinalizeCopyAndShrink, when some pre-boundary series are inactive;
  • a lifetime test where mappedSnapshot is explicitly released while the snapshot is still in use, without artificial runtime.KeepAlive (supports the inline comment on TypedPostShrinkSnapshotResolver).

15. runtime.KeepAlive in Go wrappers

FinalizeCopyAndShrink keeps the snapshot alive via runtime.KeepAlive(resolveSnapshot). Please verify that newToOldMapping is similarly retained for the duration of the C++ call. Also, after FinalizeCopyAndShrink destination.dstSrcLsIdsMapping is no longer used and can be set to nil (currently it stays reachable for no reason).


Pre-merge checklist for enabling the feature by default

  1. Apply the cleanup from the inline comment on ShrinkState::is_hidden_in_fixed_state (delete the dead helper) and pin the "source LSS rejects find_or_emplace in Fixed/Shrunk" invariant; ADR already updated.
  2. Add the Go-layer empty-LabelSet filtering test from #2.
  3. Pin Snapshot ownership in the resolver (see inline comment on TypedPostShrinkSnapshotResolver, assigned to @gshigin).
  4. Pick a direction for #10 and record it in the ADR.
  5. Tighten ExportSymbolId hashing (see inline comment) and decouple mark_active from the WAL decoder (see inline comment).
  6. Add the freeze-phase histogram and benchmark (see inline comment on prune_hidden_series_before_fixed_state).
  7. Add tests for Fixed → Fixed retry (#13) and lifetime scenarios (#14).

Copy link
Copy Markdown
Collaborator

@vporoshok vporoshok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline notes anchored to specific lines, complementing the consolidated review above.

Comment thread pp/series_index/queryable_encoding_bimap.h Outdated
Comment thread pp/series_index/queryable_encoding_bimap.h
Comment thread pp/series_index/queryable_encoding_bimap.h
Comment thread pp/series_index/queryable_encoding_bimap.h
Comment thread pp/series_index/post_shrink_snapshot_resolver.h Outdated
Comment thread pp/series_index/prometheus/tsdb/index/index_write_context.h Outdated
Comment thread pp/wal/decoder.h Outdated
Comment thread pp/go/storage/head/shard/lss.go
Comment thread pp/go/storage/head/shard/lss.go Outdated
Copy link
Copy Markdown
Collaborator

@vporoshok vporoshok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second review round — approve with nits

Thanks for working through the previous round so thoroughly. All nine inline threads from the previous review have been resolved by code changes, the freeze benchmark and the WAL mark_active recovery test are very welcome additions, and the resolver lifetime fix is a clean architectural solution.

Resolved by code

  • #1 is_hidden_in_fixed_state removed (9f5b9d3).
  • #3 TypedPostShrinkSnapshotResolver now owns Snapshot snapshot_{} by value; combined with requires kIsReadOnly copy-assignment for GenericDecodingTable / LabelSet::storage_type / SnapshotLSS / ShrinkAwareSnapshotLSS and SharedSpanWithChangesDetection storage, the copy is a cheap refcount bump and the resolver safely outlives the source snapshot (7a33f13).
  • #4 Fixed → Fixed retry supported; asserts relaxed to state != kShrunk; FixedStateCanBeEnteredTwiceForRetry test added (a2fe5a3, 77080b9).
  • #6 Unlocked-write invariant documented on LSS fields, CopyAddedSeriesTo, FreezeAndCopyAddedSeries, and Shard.DstSrcLsIdsMapping.
  • #8 ExportSymbolId is now bit-packed (8 bytes, no #pragma pack), the hasher has no bit overlap (phmap_mix(packed_name_id) → phmap_mix(hash + value_id)), and the symbol table stores ExportSymbolId rather than string_view. Bonus: SymbolIdsCollector extracted in 9de5856.
  • #9 mark_active lifted out of decoder.h in all overloads; recovery test AddedSeriesMatchesSeriesBeforeFirstShrink added (4a3ad8e, 7e6c34a).
  • #12 assert added to for_each_snapshot_symbol_id.
  • #17 stale godoc removed; (*shard.LSS).SetPendingShrinkBoundary collapsed into FreezeAndCopyAddedSeries.

Carried over / open

  • #7 (observability) — see inline on rotator.go. The freeze benchmark covers offline tuning; we still need a production histogram for the freeze phase.
  • #10 (Normal/Fixed external contract) — addressed on the wiki side. I've added an explicit «External state contract» section to the ADR clarifying that Normal and Fixed are observationally identical and that !is_shrunk() is the right check for «can I use this LSS as usual?». No code change needed.

Minor inline nits

The remaining inline comments are documentation / DRY suggestions on the new code (TypedPostShrinkSnapshotResolver, storage_type::operator=, freeze heuristic constants, ExportSymbolId::is_empty(), SymbolIdsCollector rebuild loop, benchmark aliases). None of them are blockers.

Approving — please address the rotator histogram (#7) and the nits at your convenience.

Comment thread pp/series_index/post_shrink_snapshot_resolver.h Outdated
Comment thread pp/primitives/snug_composites_filaments.h
Comment thread pp/series_index/queryable_encoding_bimap.h Outdated
Comment thread pp/series_index/prometheus/tsdb/index/index_write_context.h
Comment thread pp/series_index/prometheus/tsdb/index/index_write_context.h Outdated
Comment thread pp/series_index/prometheus/tsdb/index/index_write_context.h Outdated
@@ -136,7 +140,12 @@ func (s *Rotator[TTask, TShard, TGoShard, THead]) rotate(
s.headAddedSeriesCopier(oldHead, newHead)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carrying over the second half of point 7 from the previous round.

The freeze benchmark + rebuild-vs-erase heuristic are great for offline tuning, but we still have no production visibility into freeze-phase latency. Could you instrument this call as a Prometheus histogram — e.g. prompp_rotator_freeze_duration_seconds — observed around s.headAddedSeriesCopier(oldHead, newHead)? Without it operators won't see freeze stalls in time and the heuristic constants can drift from reality.

gshigin and others added 5 commits June 2, 2026 08:12
Signed-off-by: Alexandr Yudin <57181751+u-veles-a@users.noreply.github.com>
… into shrinked_encoding_bimap

Signed-off-by: Alexandr Yudin <57181751+u-veles-a@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants