Skip to content

Conversation

@freakyzoidberg
Copy link
Member

@freakyzoidberg freakyzoidberg commented Jul 24, 2025

Fix build following conflict introduced by concurrent change in #667 and #666

This pull request introduces uniformisation with the rest of the library to the CountMinSketch implementation, including improved serialization, memory efficiency, and validation checks. It also updates test cases to reflect these changes. The most important changes are grouped below by theme:

Core Improvements to CountMinSketch:

  • Added a thread-local MemorySegment for efficient long-to-byte conversions, reducing allocations in hot paths (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2].
  • Introduced robust validation for numHashes and numBuckets parameters, including checks for overflow and better error messages (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL60-R135).
  • Replaced ByteBuffer usage with MemorySegment for serialization/deserialization, improving performance and simplifying code (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL387-R492).
  • Added a new toByteArray method for serialization and refactored the deserialization logic for better maintainability (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2].

Code Simplifications and Optimizations:

  • Refactored methods like getHashes, update, and getEstimate to use MemorySegment for byte array conversions and improved readability with consistent use of final variables (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2] [3] [4].
  • Simplified loop logic in methods like getEstimate to avoid redundant processing (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL242-R291).

Test Case Updates:

  • Updated test cases to use the new toByteArray method for serialization and adjusted assertions for compatibility with the refactored deserialization logic (src/test/java/org/apache/datasketches/count/CountMinSketchTest.java, [1] [2].
  • Modified cross-language test in CpcSketchCrossLanguageTest to use MemorySegment for consistency with the updated serialization approach (src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java, src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.javaL92-R92).

@freakyzoidberg freakyzoidberg changed the title Fix build following cms/cpc recent PR Fix build + minor improvement for validation and allocation following cms/cpc recent PR Jul 24, 2025
@freakyzoidberg freakyzoidberg changed the title Fix build + minor improvement for validation and allocation following cms/cpc recent PR Fix build + minor improvement for validation and allocation following in CMS and CPCxLang Test Jul 24, 2025
@freakyzoidberg freakyzoidberg requested a review from leerho July 24, 2025 10:10
@jmalkin
Copy link
Contributor

jmalkin commented Jul 24, 2025

Not thrilled with the use of bytebuffer and its endian quirks. I'd look for a way to do it with memory/FFM

import org.apache.datasketches.tuple.Util;

import java.io.ByteArrayOutputStream;
import java.nio.ByteBuffer;
Copy link
Contributor

@leerho leerho Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not recommend the use of ByteBuffer or any of its subclasses. Since the main branch is targeted for Java 25, I recommend switching to use FFM MemorySegments. Here are my reasons:

  • ByteBuffer off-heap allocations cannot be interactively (dynamically) closed, they are managed by the GC. Thus out-of-memory scenarios can happen in large systems with millions of sketches because the GC can't keep up with new off-heap allocations.
  • The default endianness of BB is Big-Endian! Even if you intentionally set it to Little-Endian, it will silently switch back to BE after duplicate(), clear(), or slice(), operations. To make matters worse, this silent behavior of BB is not documented in the official Javadocs!
  • Managing endianness with BB requires a great deal of care and it is very easy to forget to constantly switch the endianness back to LE. Thus, I consider its use here in this class as dangerous since there is no explicit settings of endianness, (whether to LE or BE!).
  • Endianness is explicitly set with FFM (and with the older datasketches.memory classes) with no silent switching of endianness, thus much safer.
  • Our library is completely LE with one exception in t-Digest for binary compatibility with externally produced t-Digest sketches. Since the introduction of CountMin is new for the Java library it should use LE ordering, since, I assume, there is no history of CountMin sketches being stored.
  • Our Theta, Tuple and CPC sketches must be LE due to their compression algorithms, which rely on LE byte ordering.

@freakyzoidberg
Copy link
Member Author

Very fair - Updating CMS to remove ByteBuffer family altogether follwing the same pattern as the FFM current work effort -

Thanks - sorry should have caught that earlier..

@leerho
Copy link
Contributor

leerho commented Jul 24, 2025

@freakyzoidberg, Thank YOU for your interest and participation!!

@freakyzoidberg
Copy link
Member Author

freakyzoidberg commented Jul 24, 2025

There it should be more aligned - now I dont think those are forcing the endian to LE - rather being nativeOrder/platform dependant.
see https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/foreign/ValueLayout.html#JAVA_LONG_UNALIGNED

If this is really an issue - that should be addresse widely through all the FFM assumed to be LE

That PR actually fix the build of the main branch

cc @leerho

final int NULL_32 = 0;
wseg.set(JAVA_INT_UNALIGNED_BIG_ENDIAN, offset, NULL_32);
wseg.set(JAVA_INT_UNALIGNED, offset, NULL_32);
offset += 4;
Copy link
Contributor

@leerho leerho Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and following) is a use-case where the new datasketches.common.positional.PositionalSegment could be used. You might want to look at it. I used the term "positional" instead of "buffer", because "buffer" is a widely used generic term and doesn't convey what is actually going on. The PS would eliminate all of the offset tracking that you had to do by hand (with magic numbers).

I had already migrated theta and tuple, but when I got to bloomfilter, I decided there was a real need for it -- so I paused and created it. It is already used in bloomfilter, tdigest, kll, and req. I intend to go back and use it in the rest of the library where it makes sense. And here it makes perfect sense, but this suggestion is optional.

Nonetheless, if you decide not to use PS, I would recommend not using magic numbers for the offset adjustments and use the build-in static constants such as Integer.BYTES, Long.BYTES, etc. The code will become more obvious as to what you are doing and more robust. :-)

Again, thank you for your work on this!

Copy link
Contributor

@leerho leerho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments,

  1. Potential use of PositionalSegment
  2. Replace magic numbers with built-in constants
  3. Why are a bunch of variables created at the end of deserialization if they are not used? There is an opportunity to use these variables to check that they are the correct values -- thus providing confidence that the MemorySegment or byte array image is correct and not garbage.
  4. Also look for:
  • Missing finals
  • Unnecessary casts
  • When wrapping a string to the next line, the "+" belongs on the next line.
  • Etc.

@freakyzoidberg freakyzoidberg changed the title Fix build + minor improvement for validation and allocation following in CMS and CPCxLang Test CMS and CPCxLangTest move to FFM, Build fix Jul 25, 2025
@freakyzoidberg freakyzoidberg requested review from Copilot and leerho July 26, 2025 13:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes build issues caused by concurrent changes and introduces FFM (Foreign Function & Memory API) migration for the CountMinSketch implementation. The changes primarily focus on replacing ByteBuffer usage with MemorySegment for improved performance and memory efficiency.

  • Replaced Memory.wrap() with MemorySegment.ofArray() in CPC cross-language tests
  • Migrated CountMinSketch from ByteBuffer to MemorySegment for serialization/deserialization
  • Added comprehensive validation and improved error messages for sketch parameters

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java Updates CPC test to use MemorySegment instead of Memory for FFM compatibility
src/test/java/org/apache/datasketches/count/CountMinSketchTest.java Simplifies test serialization by using new toByteArray() method
src/main/java/org/apache/datasketches/count/CountMinSketch.java Major refactoring to use MemorySegment, adds validation, and introduces thread-local optimization
Comments suppressed due to low confidence (4)

src/main/java/org/apache/datasketches/count/CountMinSketch.java:26

  • The import org.apache.datasketches.common.positional.PositionalSegment may not exist in the current version of the datasketches library. Please verify this class is available in your version or consider using standard MemorySegment operations directly.
import org.apache.datasketches.common.positional.PositionalSegment;

src/main/java/org/apache/datasketches/count/CountMinSketch.java:104

  • The memory calculation in the error message is incorrect. It uses integer division which will often result in 0 GB for realistic sketch sizes. Consider using floating-point division: String.format("%.2f", totalSize * Long.BYTES / (1024.0 * 1024.0 * 1024.0))
          + " = " + totalSize + " elements (~" + String.format("%d", totalSize * Long.BYTES / (1024 * 1024 * 1024)) + " GB). "

src/main/java/org/apache/datasketches/count/CountMinSketch.java:406

  • The PositionalSegment.wrap() method may not exist in the current datasketches version. If PositionalSegment is not available, consider using MemorySegment directly with appropriate offset tracking.
    final PositionalSegment posSeg = PositionalSegment.wrap(MemorySegment.ofArray(bytes));

src/main/java/org/apache/datasketches/count/CountMinSketch.java:448

  • The PositionalSegment.wrap() method may not exist in the current datasketches version. If PositionalSegment is not available, consider using MemorySegment directly with appropriate offset tracking.
    final PositionalSegment posSeg = PositionalSegment.wrap(MemorySegment.ofArray(b));

@leerho
Copy link
Contributor

leerho commented Jul 28, 2025

Sorry I haven’t gotten to this, I’ve been busy with some “Honey, do … !” Assignments. Will try to review this tomorrow 😁

Copy link
Contributor

@leerho leerho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

@leerho leerho merged commit 9d1f24f into main Jul 29, 2025
@leerho leerho deleted the fix-build-cpc-cms branch July 29, 2025 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants