CMS and CPCxLangTest move to FFM, Build fix #676

freakyzoidberg · 2025-07-24T10:05:48Z

Fix build following conflict introduced by concurrent change in #667 and #666

This pull request introduces uniformisation with the rest of the library to the CountMinSketch implementation, including improved serialization, memory efficiency, and validation checks. It also updates test cases to reflect these changes. The most important changes are grouped below by theme:

Core Improvements to `CountMinSketch`:

Added a thread-local MemorySegment for efficient long-to-byte conversions, reducing allocations in hot paths (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2].
Introduced robust validation for numHashes and numBuckets parameters, including checks for overflow and better error messages (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL60-R135).
Replaced ByteBuffer usage with MemorySegment for serialization/deserialization, improving performance and simplifying code (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL387-R492).
Added a new toByteArray method for serialization and refactored the deserialization logic for better maintainability (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2].

Code Simplifications and Optimizations:

Refactored methods like getHashes, update, and getEstimate to use MemorySegment for byte array conversions and improved readability with consistent use of final variables (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2] [3] [4].
Simplified loop logic in methods like getEstimate to avoid redundant processing (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL242-R291).

Test Case Updates:

Updated test cases to use the new toByteArray method for serialization and adjusted assertions for compatibility with the refactored deserialization logic (src/test/java/org/apache/datasketches/count/CountMinSketchTest.java, [1] [2].
Modified cross-language test in CpcSketchCrossLanguageTest to use MemorySegment for consistency with the updated serialization approach (src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java, src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.javaL92-R92).

remove memory import

jmalkin · 2025-07-24T16:38:29Z

Not thrilled with the use of bytebuffer and its endian quirks. I'd look for a way to do it with memory/FFM

leerho · 2025-07-24T18:31:37Z

src/main/java/org/apache/datasketches/count/CountMinSketch.java

-import org.apache.datasketches.tuple.Util;

 import java.io.ByteArrayOutputStream;
 import java.nio.ByteBuffer;


I do not recommend the use of ByteBuffer or any of its subclasses. Since the main branch is targeted for Java 25, I recommend switching to use FFM MemorySegments. Here are my reasons:

ByteBuffer off-heap allocations cannot be interactively (dynamically) closed, they are managed by the GC. Thus out-of-memory scenarios can happen in large systems with millions of sketches because the GC can't keep up with new off-heap allocations.

The default endianness of BB is Big-Endian! Even if you intentionally set it to Little-Endian, it will silently switch back to BE after duplicate(), clear(), or slice(), operations. To make matters worse, this silent behavior of BB is not documented in the official Javadocs!

Managing endianness with BB requires a great deal of care and it is very easy to forget to constantly switch the endianness back to LE. Thus, I consider its use here in this class as dangerous since there is no explicit settings of endianness, (whether to LE or BE!).

Endianness is explicitly set with FFM (and with the older datasketches.memory classes) with no silent switching of endianness, thus much safer.

Our library is completely LE with one exception in t-Digest for binary compatibility with externally produced t-Digest sketches. Since the introduction of CountMin is new for the Java library it should use LE ordering, since, I assume, there is no history of CountMin sketches being stored.

Our Theta, Tuple and CPC sketches must be LE due to their compression algorithms, which rely on LE byte ordering.

src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java

freakyzoidberg · 2025-07-24T19:01:24Z

Very fair - Updating CMS to remove ByteBuffer family altogether follwing the same pattern as the FFM current work effort -

Thanks - sorry should have caught that earlier..

leerho · 2025-07-24T19:13:18Z

@freakyzoidberg, Thank YOU for your interest and participation!!

freakyzoidberg · 2025-07-24T19:20:13Z

There it should be more aligned - now I dont think those are forcing the endian to LE - rather being nativeOrder/platform dependant.
see https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/foreign/ValueLayout.html#JAVA_LONG_UNALIGNED

If this is really an issue - that should be addresse widely through all the FFM assumed to be LE

That PR actually fix the build of the main branch

cc @leerho

leerho · 2025-07-24T22:22:09Z

src/main/java/org/apache/datasketches/count/CountMinSketch.java

    final int NULL_32 = 0;
-    wseg.set(JAVA_INT_UNALIGNED_BIG_ENDIAN, offset, NULL_32);
+    wseg.set(JAVA_INT_UNALIGNED, offset, NULL_32);
    offset += 4;


This (and following) is a use-case where the new datasketches.common.positional.PositionalSegment could be used. You might want to look at it. I used the term "positional" instead of "buffer", because "buffer" is a widely used generic term and doesn't convey what is actually going on. The PS would eliminate all of the offset tracking that you had to do by hand (with magic numbers).

I had already migrated theta and tuple, but when I got to bloomfilter, I decided there was a real need for it -- so I paused and created it. It is already used in bloomfilter, tdigest, kll, and req. I intend to go back and use it in the rest of the library where it makes sense. And here it makes perfect sense, but this suggestion is optional.

Nonetheless, if you decide not to use PS, I would recommend not using magic numbers for the offset adjustments and use the build-in static constants such as Integer.BYTES, Long.BYTES, etc. The code will become more obvious as to what you are doing and more robust. :-)

Again, thank you for your work on this!

leerho

A few comments,

Potential use of PositionalSegment
Replace magic numbers with built-in constants
Why are a bunch of variables created at the end of deserialization if they are not used? There is an opportunity to use these variables to check that they are the correct values -- thus providing confidence that the MemorySegment or byte array image is correct and not garbage.
Also look for:

Missing finals
Unnecessary casts
When wrapping a string to the next line, the "+" belongs on the next line.
Etc.

src/main/java/org/apache/datasketches/count/CountMinSketch.java

Copilot

Pull Request Overview

This PR fixes build issues caused by concurrent changes and introduces FFM (Foreign Function & Memory API) migration for the CountMinSketch implementation. The changes primarily focus on replacing ByteBuffer usage with MemorySegment for improved performance and memory efficiency.

Replaced Memory.wrap() with MemorySegment.ofArray() in CPC cross-language tests
Migrated CountMinSketch from ByteBuffer to MemorySegment for serialization/deserialization
Added comprehensive validation and improved error messages for sketch parameters

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java	Updates CPC test to use MemorySegment instead of Memory for FFM compatibility
src/test/java/org/apache/datasketches/count/CountMinSketchTest.java	Simplifies test serialization by using new toByteArray() method
src/main/java/org/apache/datasketches/count/CountMinSketch.java	Major refactoring to use MemorySegment, adds validation, and introduces thread-local optimization

Comments suppressed due to low confidence (4)

src/main/java/org/apache/datasketches/count/CountMinSketch.java:26

The import org.apache.datasketches.common.positional.PositionalSegment may not exist in the current version of the datasketches library. Please verify this class is available in your version or consider using standard MemorySegment operations directly.

import org.apache.datasketches.common.positional.PositionalSegment;

src/main/java/org/apache/datasketches/count/CountMinSketch.java:104

The memory calculation in the error message is incorrect. It uses integer division which will often result in 0 GB for realistic sketch sizes. Consider using floating-point division: String.format("%.2f", totalSize * Long.BYTES / (1024.0 * 1024.0 * 1024.0))

          + " = " + totalSize + " elements (~" + String.format("%d", totalSize * Long.BYTES / (1024 * 1024 * 1024)) + " GB). "

src/main/java/org/apache/datasketches/count/CountMinSketch.java:406

The PositionalSegment.wrap() method may not exist in the current datasketches version. If PositionalSegment is not available, consider using MemorySegment directly with appropriate offset tracking.

    final PositionalSegment posSeg = PositionalSegment.wrap(MemorySegment.ofArray(bytes));

src/main/java/org/apache/datasketches/count/CountMinSketch.java:448

The PositionalSegment.wrap() method may not exist in the current datasketches version. If PositionalSegment is not available, consider using MemorySegment directly with appropriate offset tracking.

    final PositionalSegment posSeg = PositionalSegment.wrap(MemorySegment.ofArray(b));

leerho · 2025-07-28T02:09:01Z

Sorry I haven’t gotten to this, I’ve been busy with some “Honey, do … !” Assignments. Will try to review this tomorrow 😁

leerho

Excellent work!

Fix build following cms/cpc recent PR

c091214

freakyzoidberg changed the title ~~Fix build following cms/cpc recent PR~~ Fix build + minor improvement for validation and allocation following cms/cpc recent PR Jul 24, 2025

freakyzoidberg changed the title ~~Fix build + minor improvement for validation and allocation following cms/cpc recent PR~~ Fix build + minor improvement for validation and allocation following in CMS and CPCxLang Test Jul 24, 2025

freakyzoidberg requested a review from leerho July 24, 2025 10:10

Update CpcSketchCrossLanguageTest.java

c6c90c3

remove memory import