Refactor ColumnStatistics: clean up interface, add isAscii#18170
Refactor ColumnStatistics: clean up interface, add isAscii#18170Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Conversation
…improve test coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors Pinot’s pre-index ColumnStatistics API and related collectors/implementations to standardize derived stats (type, element lengths, ASCII flag, row lengths, sortedness) across default columns, mutable segments, and compacted (valid-doc-only) realtime segments, while expanding unit test coverage for additional logical and MV types.
Changes:
- Refactors
ColumnStatisticsto beFieldSpec-driven (adds default helpers likegetValueType(),isSingleValue(),isFixedLength(),isAscii(),getMaxRowLengthInBytes()and standardizesgetLengthOfLongestElement()). - Adds ASCII/length tracking to mutable forward indexes and propagates through mutable / compacted column statistics implementations.
- Expands and reorders assertions in stats-related tests (SV/MV, logical types,
isAscii, shortest/longest element lengths, empty-column behavior).
Reviewed changes
Copilot reviewed 43 out of 43 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pinot-spi/src/main/java/org/apache/pinot/spi/utils/Utf8Utils.java | Replaces prior misplaced content with UTF-8 helper utilities and adds ASCII detection for UTF-8 bytes. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableForwardIndex.java | Adds isAscii() contract for mutable forward indexes. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/IndexCreationContext.java | Switches to getLengthOfLongestElement() when deriving creation context. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/ColumnStatistics.java | Refactors stats API with FieldSpec access, stored-type helpers, and new default behaviors. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/DefaultColumnStatisticsTest.java | New test coverage for default-column stats across SV/MV and logical types. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/forward/mutable/CLPMutableForwardIndexTest.java | Updates CLP mutable forward index constructor usage. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/ColumnMetadataTest.java | Adjusts test to mock ColumnStatistics instead of constructing legacy DefaultColumnStatistics. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollectorTest.java | Expands SV/MV tests, adds isAscii tests, and updates renamed APIs. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollectorTest.java | Updates length API and bytes collector rename assertions. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/EmptyColumnStatisticsTest.java | New tests validating empty-column stats behavior across types. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/AbstractColumnStatisticsCollectorTest.java | Broadens coverage (logical types, isAscii, length tracking) and aligns assertion ordering with new API. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/DictionariesTest.java | Updates bytes stats collector rename usage. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableNoDictColumnStatisticsTest.java | New comprehensive tests for mutable no-dict stats (SV/MV, lengths, ASCII, sortedness). |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatisticsTest.java | Major expansion to cover more stored/logical types and ASCII/length semantics. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/CompactedNoDictColumnStatisticsTest.java | Expands compacted no-dict stats tests, including logical types and ASCII/length behavior. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentEntriesAboveThresholdTest.java | Updates fake forward index to implement new isAscii() method. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandler.java | Updates bytes collector rename and length API usage in forward index handling. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/DefaultColumnStatistics.java | Refactors default-column stats to derive everything from FieldSpec (including lengths and ASCII). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java | Simplifies default-value index creation by using new DefaultColumnStatistics. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/StringColumnPreIndexStatsCollector.java | Adds ASCII tracking and standardizes length API; minor dedup-aware length updates. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/StatsCollectorUtil.java | Updates bytes collector rename in factory. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/SegmentPreIndexStatsCollectorImpl.java | Refactors internal map to store ColumnStatistics, adds empty-segment handling via EmptyColumnStatistics. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollector.java | Adds fixed-width handling, ASCII tracking, and standardized length/row-length semantics. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java | Updates bytes collector rename and length API, normalizes empty-length behavior. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/EmptyColumnStatistics.java | Introduces a dedicated stats implementation for empty columns. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java | Reworks columnar stats collection into a SegmentPreIndexStatsContainer with empty-segment handling. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BytesColumnPreIndexStatsCollector.java | Renames and updates bytes stats collector; adjusts length semantics for empty. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BigDecimalColumnPreIndexStatsCollector.java | Updates length API and simplifies max-row-length handling. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/AbstractColumnStatisticsCollector.java | Centralizes FieldSpec/stored-type exposure and standardizes sortedness/partition plumbing. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/ColumnarSegmentCreationDataSource.java | Updates to return SegmentPreIndexStatsContainer and constructs columnar stats container directly. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/VarByteSVMutableForwardIndex.java | Tracks shortest/longest element length and ASCII for string raw forward indexes. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/SameValueMutableForwardIndex.java | Adds UTF-8 length + ASCII reporting for same-value wrapper forward index. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/FixedByteSVMutableForwardIndex.java | Implements isAscii() (always false for fixed-width). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/FixedByteMVMutableForwardIndex.java | Implements isAscii() (always false for fixed-width). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndexV2.java | Adds ASCII + length tracking for CLP v2 and updates stored-type handling. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndex.java | Updates ctor signature, uses UTF-8 byte length for element-length stats, adds ASCII tracking. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/SameValueMutableDictionary.java | Ensures same-value dictionary returns consistent bytes/value size and supports ASCII-sensitive paths. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/RealtimeSegmentStatsContainer.java | Adds empty-valid-doc handling via EmptyColumnStatistics and refactors map initialization. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableNoDictColumnStatistics.java | Implements new ColumnStatistics requirements (getFieldSpec, isAscii, length API). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatistics.java | Adds FieldSpec-based typing and ASCII/length scanning over dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/CompactedNoDictColumnStatistics.java | Computes all stats in one pass over valid docs (including ASCII and row length). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/CompactedColumnStatistics.java | Computes all stats in one pass over valid docs, including ASCII and row lengths. |
...g/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollector.java
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18170 +/- ##
============================================
+ Coverage 63.13% 63.17% +0.03%
- Complexity 1610 1616 +6
============================================
Files 3213 3215 +2
Lines 195730 195811 +81
Branches 30240 30275 +35
============================================
+ Hits 123583 123703 +120
+ Misses 62281 62216 -65
- Partials 9866 9892 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
ColumnStatisticsinterface: add Javadoc, default methods (getValueType,isSingleValue,isFixedLength,isAscii,getMaxRowLengthInBytes)BytesColumnPredIndexStatsCollector→BytesColumnPreIndexStatsCollector, extractEmptyColumnStatisticsfor empty columns, simplifyDefaultColumnStatisticsto derive all stats fromFieldSpecMutableForwardIndexto track element lengths and ASCII flag, propagated throughMutableColumnStatisticsandMutableNoDictColumnStatisticsCompactedColumnStatisticsandCompactedNoDictColumnStatisticsto compute all stats (min/max, cardinality, element lengths, ASCII, sorted, row lengths) in a single pass over valid documentsColumnStatisticsAPI,getLengthOfShortestElementcoverageTest plan
pinot-segment-local,pinot-segment-spi)EmptyColumnStatisticsTest,DefaultColumnStatisticsTest,MutableNoDictColumnStatisticsTestMutableColumnStatisticsTest,CompactedColumnStatisticsTest,CompactedNoDictColumnStatisticsTest,AbstractColumnStatisticsCollectorTest,NoDictColumnStatisticsCollectorTest🤖 Generated with Claude Code