Skip to content

Fix BPE non-determinism and add comprehensive test suite#45

Merged
JingsongLi merged 8 commits into
apache:mainfrom
JingsongLi:bpe_fix
May 25, 2026
Merged

Fix BPE non-determinism and add comprehensive test suite#45
JingsongLi merged 8 commits into
apache:mainfrom
JingsongLi:bpe_fix

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi commented May 25, 2026

Fix a non-determinism bug in BPE vocabulary builder where HashMap iteration order caused different byte output for identical input data when column names share common substrings. The fix uses the pair key as a deterministic tie-breaker when multiple pairs have equal frequency.

Add 150+ new tests across Rust, Java, and Python covering:

  • Massive data volumes (up to 10M rows, 1000 columns)
  • 20 data pattern varieties (sawtooth, fibonacci, unicode, sparse, etc.)
  • Encoding/compression/projection interactions (30 tests)
  • Determinism and re-roundtrip stability (18 tests)
  • Concurrent multi-threaded reading (5 tests)
  • Binary format byte-level verification (10 tests)
  • Statistics min/max/null_count accuracy (11 tests)
  • Cross-language Rust/Java/Python interoperability (25 tests)

JingsongLi and others added 8 commits May 25, 2026 13:18
Fix a non-determinism bug in BPE vocabulary builder where HashMap
iteration order caused different byte output for identical input data
when column names share common substrings. The fix uses the pair key
as a deterministic tie-breaker when multiple pairs have equal frequency.

Add 150+ new tests across Rust, Java, and Python covering:
- Massive data volumes (up to 10M rows, 1000 columns)
- 20 data pattern varieties (sawtooth, fibonacci, unicode, sparse, etc.)
- Encoding/compression/projection interactions (30 tests)
- Determinism and re-roundtrip stability (18 tests)
- Fuzz robustness with corrupted input (12 tests)
- Concurrent multi-threaded reading (5 tests)
- Binary format byte-level verification (10 tests)
- Statistics min/max/null_count accuracy (11 tests)
- Cross-language Rust/Java/Python interoperability (25 tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ne fuzz tests

Interop read tests depend on files written by other languages' test suites,
which are not available in CI where each language runs in a separate job.
Java tests now use Assume.assumeTrue, Python tests use pytest.mark.skipif,
and Rust tests use #[ignore]. Five fuzz tests that can trigger OOM abort
on CI are also marked #[ignore].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs all languages in a single job so interop files persist across steps:
1. Rust writes golden .mosaic files to /tmp/mosaic_interop/
2. Java reads and verifies Rust-written files, writes java_written.mosaic
3. Python reads and verifies Rust-written files, writes python_written.mosaic
4. Rust reads and verifies Java/Python-written files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of a separate interop workflow, add Rust interop_write_test steps
to java-test and python-test jobs so golden files are available in the
same job. Remove all skip/ignore guards since files are always present.
After Java/Python tests write their own files, Rust reads them back.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All fuzz tests (corrupted/random data) are marked #[ignore] because they
can trigger OOM abort (SIGABRT) which catch_unwind cannot handle. The 5
concurrent reader tests using valid data remain active.

Interop tests are integrated into java-test and python-test CI jobs:
Rust writes golden files first, then Java/Python read them and write
their own files, then Rust reads those back. No skip/ignore needed
since all steps run in the same job.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fuzz tests feed corrupted data to the reader, which can trigger OOM
abort that catch_unwind cannot handle. Not suitable for CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The BPE vocabulary tie-breaker now uses the pair key together with frequency, so equal-frequency pairs no longer depend on HashMap iteration order.

I reviewed the CI changes and the added Rust/Java/Python coverage, including the shared-substring determinism regression and cross-language interop flow. I also ran the targeted Rust checks locally:

cargo test -p paimon-mosaic-core bpe -- --nocapture
cargo test -p paimon-mosaic-core --test determinism_test test_deterministic_shared_substring_columns -- --nocapture

Both passed, and GitHub CI is green. +1.

@JingsongLi JingsongLi merged commit 83acb46 into apache:main May 25, 2026
5 checks passed
XiaoHongbo-Hope pushed a commit that referenced this pull request May 25, 2026
Fix a non-determinism bug in BPE vocabulary builder where HashMap
iteration order caused different byte output for identical input data
when column names share common substrings. The fix uses the pair key
as a deterministic tie-breaker when multiple pairs have equal frequency.

Add 150+ new tests across Rust, Java, and Python covering:
- Massive data volumes (up to 10M rows, 1000 columns)
- 20 data pattern varieties (sawtooth, fibonacci, unicode, sparse, etc.)
- Encoding/compression/projection interactions (30 tests)
- Determinism and re-roundtrip stability (18 tests)
- Concurrent multi-threaded reading (5 tests)
- Binary format byte-level verification (10 tests)
- Statistics min/max/null_count accuracy (11 tests)
- Cross-language Rust/Java/Python interoperability (25 tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants