Shrinked QEB#241
Conversation
There was a problem hiding this comment.
Review in the context of .wiki/adr/2026-02-24-shrink-on-rotation.md
TL;DR
The PR matches the ADR overall: Normal/Fixed/Shrunk states in QueryableEncodingBimap, ShrinkState, PostShrinkSnapshotResolver, SymbolSource / IndexWriteContext, ShrinkAwareSnapshotLSS, and the shrink_shard_copier feature flag wired through FreezeAndCopyAddedSeries → FinalizeCopyAndShrink. Test coverage is solid and the style remarks from @cherep58 / @u-veles-a are largely addressed.
Items that anchor to specific lines are moved to inline comments. The non-anchored items are listed below.
2. Stale reverse_index_ / trie_index_ in Fixed / Shrunk → phantom ids in selector results
prune_hidden_series_before_fixed_state cleans up ls_id_hash_set_ and ls_id_set_ only; Querier::query walks reverse_index() / trie_index(). After resolving the empty-LabelSet discussion (see the inline comment on is_hidden_in_fixed_state), this is "correct" in Fixed (labels are real). In Shrunk it relies on the Go layer to drop series with an empty LabelSet.
Decision: accepted as a minor risk and recorded in the ADR. For /api/v1/series the response may include ids of series whose labels are already hidden; for all other endpoints the rows are filtered out by DataStorage and not user-visible.
Action item: add an e2e test that queries the LSS in Shrunk with a matcher hitting both mapped and unmapped ids, demonstrating that an empty LabelSet is filtered correctly on the Go side.
10. is_normal() / is_fixed() contract — externally, Fixed ≡ Normal
After the empty-LabelSet decision (#1), no QEB method branches on is_fixed() (only state-transition asserts); is_normal() is not used in logic at all. Fixed is effectively an internal-only marker meaning "pending_shrink_boundary is set, awaiting finalize_copy_and_shrink".
| Method | Normal | Fixed | Shrunk |
|---|---|---|---|
find / find_or_emplace |
yes | yes (hash-set already pruned, search path identical) | yes |
resolve_impl(id) |
storage_composite(id) |
storage_composite(id) (after #1) |
resolve_shrunk_series |
symbol_source_for_series(id) |
kCurrent |
kCurrent |
kCurrent for id ≥ shift, kSnapshot otherwise |
for_each_snapshot_symbol_id |
no-op | no-op | walks resolver |
Action item — please pick a direction and record it in the ADR:
- (A) Keep the 3-state enum, add a docstring above
enum State/is_fixed(): "Fixed is externally indistinguishable from Normal; it is an internal marker that recordspending_shrink_boundarybetweenset_pending_shrink_boundaryandfinalize_copy_and_shrink." - (B) Replace
enum Statewithbool is_shrunk_+std::optional<uint32_t> pending_shrink_boundary; "Fixed = boundary set && !shrunk" becomes a structural invariant rather than an enum value. - (C) Leave as is without comments (nit — three states were chosen intentionally).
11. Grouping of logic inside ShrinkState
@cherep58's note "operations with shrink_state may be moved to a method of ShrinkState" is largely addressed, but mark_active still lives in QEB though it is conceptually about series activity. Optional: move it to ShrinkState::mark_active(added_series_, ls_id), or leave it with a one-line comment explaining the choice.
13. Test coverage of QueryableEncodingBimap
Missing cases:
- repeated
set_pending_shrink_boundary(Fixed → Fixed) — a legal path once the assert is relaxed (see inline comment onenter_fixed_state); finalize_copy_and_shrinkwithout a priorset_pending_shrink_boundary(expected behaviour, likely an assert);resolve_shrunk_series(id < shift)whenpost_shrink_snapshot_resolver == nullptr(must yield the emptyLabelSet).
14. Test coverage of RotateLSSSuite
- query against the old LSS after
FinalizeCopyAndShrink, when some pre-boundary series are inactive; - a lifetime test where
mappedSnapshotis explicitly released while the snapshot is still in use, without artificialruntime.KeepAlive(supports the inline comment onTypedPostShrinkSnapshotResolver).
15. runtime.KeepAlive in Go wrappers
FinalizeCopyAndShrink keeps the snapshot alive via runtime.KeepAlive(resolveSnapshot). Please verify that newToOldMapping is similarly retained for the duration of the C++ call. Also, after FinalizeCopyAndShrink destination.dstSrcLsIdsMapping is no longer used and can be set to nil (currently it stays reachable for no reason).
Pre-merge checklist for enabling the feature by default
- Apply the cleanup from the inline comment on
ShrinkState::is_hidden_in_fixed_state(delete the dead helper) and pin the "source LSS rejectsfind_or_emplaceinFixed/Shrunk" invariant; ADR already updated. - Add the Go-layer empty-
LabelSetfiltering test from #2. - Pin
Snapshotownership in the resolver (see inline comment onTypedPostShrinkSnapshotResolver, assigned to @gshigin). - Pick a direction for #10 and record it in the ADR.
- Tighten
ExportSymbolIdhashing (see inline comment) and decouplemark_activefrom the WAL decoder (see inline comment). - Add the freeze-phase histogram and benchmark (see inline comment on
prune_hidden_series_before_fixed_state). - Add tests for
Fixed → Fixedretry (#13) and lifetime scenarios (#14).
vporoshok
left a comment
There was a problem hiding this comment.
Inline notes anchored to specific lines, complementing the consolidated review above.
Signed-off-by: Alexandr Yudin <57181751+u-veles-a@users.noreply.github.com>
Signed-off-by: Alexandr Yudin <57181751+u-veles-a@users.noreply.github.com>
vporoshok
left a comment
There was a problem hiding this comment.
Second review round — approve with nits
Thanks for working through the previous round so thoroughly. All nine inline threads from the previous review have been resolved by code changes, the freeze benchmark and the WAL mark_active recovery test are very welcome additions, and the resolver lifetime fix is a clean architectural solution.
Resolved by code
- ✅ #1
is_hidden_in_fixed_stateremoved (9f5b9d3). - ✅ #3
TypedPostShrinkSnapshotResolvernow ownsSnapshot snapshot_{}by value; combined withrequires kIsReadOnlycopy-assignment forGenericDecodingTable/LabelSet::storage_type/SnapshotLSS/ShrinkAwareSnapshotLSSandSharedSpanWithChangesDetectionstorage, the copy is a cheap refcount bump and the resolver safely outlives the source snapshot (7a33f13). - ✅ #4 Fixed → Fixed retry supported; asserts relaxed to
state != kShrunk;FixedStateCanBeEnteredTwiceForRetrytest added (a2fe5a3, 77080b9). - ✅ #6 Unlocked-write invariant documented on
LSSfields,CopyAddedSeriesTo,FreezeAndCopyAddedSeries, andShard.DstSrcLsIdsMapping. - ✅ #8
ExportSymbolIdis now bit-packed (8 bytes, no#pragma pack), the hasher has no bit overlap (phmap_mix(packed_name_id) → phmap_mix(hash + value_id)), and the symbol table storesExportSymbolIdrather thanstring_view. Bonus:SymbolIdsCollectorextracted in 9de5856. - ✅ #9
mark_activelifted out ofdecoder.hin all overloads; recovery testAddedSeriesMatchesSeriesBeforeFirstShrinkadded (4a3ad8e, 7e6c34a). - ✅ #12 assert added to
for_each_snapshot_symbol_id. - ✅ #17 stale godoc removed;
(*shard.LSS).SetPendingShrinkBoundarycollapsed intoFreezeAndCopyAddedSeries.
Carried over / open
- #7 (observability) — see inline on
rotator.go. The freeze benchmark covers offline tuning; we still need a production histogram for the freeze phase. - #10 (Normal/Fixed external contract) — addressed on the wiki side. I've added an explicit «External state contract» section to the ADR clarifying that
NormalandFixedare observationally identical and that!is_shrunk()is the right check for «can I use this LSS as usual?». No code change needed.
Minor inline nits
The remaining inline comments are documentation / DRY suggestions on the new code (TypedPostShrinkSnapshotResolver, storage_type::operator=, freeze heuristic constants, ExportSymbolId::is_empty(), SymbolIdsCollector rebuild loop, benchmark aliases). None of them are blockers.
Approving — please address the rotator histogram (#7) and the nits at your convenience.
| @@ -136,7 +140,12 @@ func (s *Rotator[TTask, TShard, TGoShard, THead]) rotate( | |||
| s.headAddedSeriesCopier(oldHead, newHead) | |||
There was a problem hiding this comment.
Carrying over the second half of point 7 from the previous round.
The freeze benchmark + rebuild-vs-erase heuristic are great for offline tuning, but we still have no production visibility into freeze-phase latency. Could you instrument this call as a Prometheus histogram — e.g. prompp_rotator_freeze_duration_seconds — observed around s.headAddedSeriesCopier(oldHead, newHead)? Without it operators won't see freeze stalls in time and the heuristic constants can drift from reality.
Signed-off-by: Alexandr Yudin <57181751+u-veles-a@users.noreply.github.com>
… into shrinked_encoding_bimap Signed-off-by: Alexandr Yudin <57181751+u-veles-a@users.noreply.github.com>
No description provided.