Skip to content

Refactor ColumnStatistics: clean up interface, add isAscii#18170

Open
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:column_stats
Open

Refactor ColumnStatistics: clean up interface, add isAscii#18170
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:column_stats

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang commented Apr 11, 2026

Summary

  • Refactor ColumnStatistics interface: add Javadoc, default methods (getValueType, isSingleValue, isFixedLength, isAscii, getMaxRowLengthInBytes)
  • Clean up stats collectors: rename BytesColumnPredIndexStatsCollectorBytesColumnPreIndexStatsCollector, extract EmptyColumnStatistics for empty columns, simplify DefaultColumnStatistics to derive all stats from FieldSpec
  • Enhance MutableForwardIndex to track element lengths and ASCII flag, propagated through MutableColumnStatistics and MutableNoDictColumnStatistics
  • Improve CompactedColumnStatistics and CompactedNoDictColumnStatistics to compute all stats (min/max, cardinality, element lengths, ASCII, sorted, row lengths) in a single pass over valid documents
  • Significantly expand test coverage: add logical type tests (BOOLEAN, TIMESTAMP, JSON), MV tests for all types, assertion ordering aligned with ColumnStatistics API, getLengthOfShortestElement coverage

Test plan

  • All existing unit tests pass (pinot-segment-local, pinot-segment-spi)
  • New tests: EmptyColumnStatisticsTest, DefaultColumnStatisticsTest, MutableNoDictColumnStatisticsTest
  • Enhanced tests: MutableColumnStatisticsTest, CompactedColumnStatisticsTest, CompactedNoDictColumnStatisticsTest, AbstractColumnStatisticsCollectorTest, NoDictColumnStatisticsCollectorTest

🤖 Generated with Claude Code

…improve test coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Jackie-Jiang Jackie-Jiang added ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior backward-incompat Introduces a backward-incompatible API or behavior change labels Apr 11, 2026
@Jackie-Jiang Jackie-Jiang changed the title Refactor ColumnStatistics: clean up interfaces, implementations, and improve test coverage Refactor ColumnStatistics: clean up interface, add isAscii Apr 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Pinot’s pre-index ColumnStatistics API and related collectors/implementations to standardize derived stats (type, element lengths, ASCII flag, row lengths, sortedness) across default columns, mutable segments, and compacted (valid-doc-only) realtime segments, while expanding unit test coverage for additional logical and MV types.

Changes:

  • Refactors ColumnStatistics to be FieldSpec-driven (adds default helpers like getValueType(), isSingleValue(), isFixedLength(), isAscii(), getMaxRowLengthInBytes() and standardizes getLengthOfLongestElement()).
  • Adds ASCII/length tracking to mutable forward indexes and propagates through mutable / compacted column statistics implementations.
  • Expands and reorders assertions in stats-related tests (SV/MV, logical types, isAscii, shortest/longest element lengths, empty-column behavior).

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/Utf8Utils.java Replaces prior misplaced content with UTF-8 helper utilities and adds ASCII detection for UTF-8 bytes.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableForwardIndex.java Adds isAscii() contract for mutable forward indexes.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/IndexCreationContext.java Switches to getLengthOfLongestElement() when deriving creation context.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/ColumnStatistics.java Refactors stats API with FieldSpec access, stored-type helpers, and new default behaviors.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/DefaultColumnStatisticsTest.java New test coverage for default-column stats across SV/MV and logical types.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/forward/mutable/CLPMutableForwardIndexTest.java Updates CLP mutable forward index constructor usage.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/ColumnMetadataTest.java Adjusts test to mock ColumnStatistics instead of constructing legacy DefaultColumnStatistics.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollectorTest.java Expands SV/MV tests, adds isAscii tests, and updates renamed APIs.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollectorTest.java Updates length API and bytes collector rename assertions.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/EmptyColumnStatisticsTest.java New tests validating empty-column stats behavior across types.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/impl/stats/AbstractColumnStatisticsCollectorTest.java Broadens coverage (logical types, isAscii, length tracking) and aligns assertion ordering with new API.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/creator/DictionariesTest.java Updates bytes stats collector rename usage.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableNoDictColumnStatisticsTest.java New comprehensive tests for mutable no-dict stats (SV/MV, lengths, ASCII, sortedness).
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatisticsTest.java Major expansion to cover more stored/logical types and ASCII/length semantics.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/CompactedNoDictColumnStatisticsTest.java Expands compacted no-dict stats tests, including logical types and ASCII/length behavior.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentEntriesAboveThresholdTest.java Updates fake forward index to implement new isAscii() method.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandler.java Updates bytes collector rename and length API usage in forward index handling.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/DefaultColumnStatistics.java Refactors default-column stats to derive everything from FieldSpec (including lengths and ASCII).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java Simplifies default-value index creation by using new DefaultColumnStatistics.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/StringColumnPreIndexStatsCollector.java Adds ASCII tracking and standardizes length API; minor dedup-aware length updates.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/StatsCollectorUtil.java Updates bytes collector rename in factory.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/SegmentPreIndexStatsCollectorImpl.java Refactors internal map to store ColumnStatistics, adds empty-segment handling via EmptyColumnStatistics.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollector.java Adds fixed-width handling, ASCII tracking, and standardized length/row-length semantics.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/MapColumnPreIndexStatsCollector.java Updates bytes collector rename and length API, normalizes empty-length behavior.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/EmptyColumnStatistics.java Introduces a dedicated stats implementation for empty columns.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/ColumnarSegmentPreIndexStatsContainer.java Reworks columnar stats collection into a SegmentPreIndexStatsContainer with empty-segment handling.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BytesColumnPreIndexStatsCollector.java Renames and updates bytes stats collector; adjusts length semantics for empty.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BigDecimalColumnPreIndexStatsCollector.java Updates length API and simplifies max-row-length handling.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/AbstractColumnStatisticsCollector.java Centralizes FieldSpec/stored-type exposure and standardizes sortedness/partition plumbing.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/ColumnarSegmentCreationDataSource.java Updates to return SegmentPreIndexStatsContainer and constructs columnar stats container directly.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/VarByteSVMutableForwardIndex.java Tracks shortest/longest element length and ASCII for string raw forward indexes.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/SameValueMutableForwardIndex.java Adds UTF-8 length + ASCII reporting for same-value wrapper forward index.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/FixedByteSVMutableForwardIndex.java Implements isAscii() (always false for fixed-width).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/FixedByteMVMutableForwardIndex.java Implements isAscii() (always false for fixed-width).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndexV2.java Adds ASCII + length tracking for CLP v2 and updates stored-type handling.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndex.java Updates ctor signature, uses UTF-8 byte length for element-length stats, adds ASCII tracking.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/SameValueMutableDictionary.java Ensures same-value dictionary returns consistent bytes/value size and supports ASCII-sensitive paths.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/RealtimeSegmentStatsContainer.java Adds empty-valid-doc handling via EmptyColumnStatistics and refactors map initialization.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableNoDictColumnStatistics.java Implements new ColumnStatistics requirements (getFieldSpec, isAscii, length API).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatistics.java Adds FieldSpec-based typing and ASCII/length scanning over dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/CompactedNoDictColumnStatistics.java Computes all stats in one pass over valid docs (including ASCII and row length).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/CompactedColumnStatistics.java Computes all stats in one pass over valid docs, including ASCII and row lengths.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 11, 2026

Codecov Report

❌ Patch coverage is 80.33981% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.17%. Comparing base (34d3fd6) to head (bdcdf9a).

Files with missing lines Patch % Lines
...l/stats/ColumnarSegmentPreIndexStatsContainer.java 68.57% 6 Missing and 5 partials ⚠️
.../impl/stats/SegmentPreIndexStatsCollectorImpl.java 63.33% 10 Missing and 1 partial ⚠️
...he/pinot/segment/spi/creator/ColumnStatistics.java 0.00% 10 Missing ⚠️
...me/impl/dictionary/SameValueMutableDictionary.java 53.33% 7 Missing ⚠️
.../realtime/impl/forward/CLPMutableForwardIndex.java 66.66% 2 Missing and 2 partials ⚠️
...ealtime/impl/forward/CLPMutableForwardIndexV2.java 76.47% 3 Missing and 1 partial ⚠️
...or/impl/stats/NoDictColumnStatisticsCollector.java 91.48% 1 Missing and 3 partials ⚠️
.../loader/defaultcolumn/DefaultColumnStatistics.java 91.48% 2 Missing and 2 partials ⚠️
...ime/converter/stats/CompactedColumnStatistics.java 88.46% 0 Missing and 3 partials ⚠️
...ltime/converter/stats/MutableColumnStatistics.java 91.17% 0 Missing and 3 partials ⚠️
... and 14 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18170      +/-   ##
============================================
+ Coverage     63.13%   63.17%   +0.03%     
- Complexity     1610     1616       +6     
============================================
  Files          3213     3215       +2     
  Lines        195730   195811      +81     
  Branches      30240    30275      +35     
============================================
+ Hits         123583   123703     +120     
+ Misses        62281    62216      -65     
- Partials       9866     9892      +26     
Flag Coverage Δ
custom-integration1 100.00% <ø> (?)
integration 100.00% <ø> (+100.00%) ⬆️
integration1 100.00% <ø> (?)
integration2 0.00% <ø> (ø)
java-11 63.12% <80.33%> (+0.01%) ⬆️
java-21 63.15% <80.33%> (+0.05%) ⬆️
temurin 63.17% <80.33%> (+0.03%) ⬆️
unittests 63.17% <80.33%> (+0.03%) ⬆️
unittests1 55.34% <23.78%> (-0.03%) ⬇️
unittests2 34.82% <80.33%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backward-incompat Introduces a backward-incompatible API or behavior change ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants