-
Notifications
You must be signed in to change notification settings - Fork 216
CMS and CPCxLangTest move to FFM, Build fix #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
remove memory import
|
Not thrilled with the use of bytebuffer and its endian quirks. I'd look for a way to do it with memory/FFM |
| import org.apache.datasketches.tuple.Util; | ||
|
|
||
| import java.io.ByteArrayOutputStream; | ||
| import java.nio.ByteBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not recommend the use of ByteBuffer or any of its subclasses. Since the main branch is targeted for Java 25, I recommend switching to use FFM MemorySegments. Here are my reasons:
- ByteBuffer off-heap allocations cannot be interactively (dynamically) closed, they are managed by the GC. Thus out-of-memory scenarios can happen in large systems with millions of sketches because the GC can't keep up with new off-heap allocations.
- The default endianness of BB is Big-Endian! Even if you intentionally set it to Little-Endian, it will silently switch back to BE after duplicate(), clear(), or slice(), operations. To make matters worse, this silent behavior of BB is not documented in the official Javadocs!
- Managing endianness with BB requires a great deal of care and it is very easy to forget to constantly switch the endianness back to LE. Thus, I consider its use here in this class as dangerous since there is no explicit settings of endianness, (whether to LE or BE!).
- Endianness is explicitly set with FFM (and with the older datasketches.memory classes) with no silent switching of endianness, thus much safer.
- Our library is completely LE with one exception in t-Digest for binary compatibility with externally produced t-Digest sketches. Since the introduction of CountMin is new for the Java library it should use LE ordering, since, I assume, there is no history of CountMin sketches being stored.
- Our Theta, Tuple and CPC sketches must be LE due to their compression algorithms, which rely on LE byte ordering.
src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java
Show resolved
Hide resolved
|
Very fair - Updating CMS to remove ByteBuffer family altogether follwing the same pattern as the FFM current work effort - Thanks - sorry should have caught that earlier.. |
|
@freakyzoidberg, Thank YOU for your interest and participation!! |
|
There it should be more aligned - now I dont think those are forcing the endian to LE - rather being nativeOrder/platform dependant. If this is really an issue - that should be addresse widely through all the FFM assumed to be LE That PR actually fix the build of the main branch cc @leerho |
| final int NULL_32 = 0; | ||
| wseg.set(JAVA_INT_UNALIGNED_BIG_ENDIAN, offset, NULL_32); | ||
| wseg.set(JAVA_INT_UNALIGNED, offset, NULL_32); | ||
| offset += 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and following) is a use-case where the new datasketches.common.positional.PositionalSegment could be used. You might want to look at it. I used the term "positional" instead of "buffer", because "buffer" is a widely used generic term and doesn't convey what is actually going on. The PS would eliminate all of the offset tracking that you had to do by hand (with magic numbers).
I had already migrated theta and tuple, but when I got to bloomfilter, I decided there was a real need for it -- so I paused and created it. It is already used in bloomfilter, tdigest, kll, and req. I intend to go back and use it in the rest of the library where it makes sense. And here it makes perfect sense, but this suggestion is optional.
Nonetheless, if you decide not to use PS, I would recommend not using magic numbers for the offset adjustments and use the build-in static constants such as Integer.BYTES, Long.BYTES, etc. The code will become more obvious as to what you are doing and more robust. :-)
Again, thank you for your work on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments,
- Potential use of PositionalSegment
- Replace magic numbers with built-in constants
- Why are a bunch of variables created at the end of deserialization if they are not used? There is an opportunity to use these variables to check that they are the correct values -- thus providing confidence that the MemorySegment or byte array image is correct and not garbage.
- Also look for:
- Missing finals
- Unnecessary casts
- When wrapping a string to the next line, the "+" belongs on the next line.
- Etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes build issues caused by concurrent changes and introduces FFM (Foreign Function & Memory API) migration for the CountMinSketch implementation. The changes primarily focus on replacing ByteBuffer usage with MemorySegment for improved performance and memory efficiency.
- Replaced
Memory.wrap()withMemorySegment.ofArray()in CPC cross-language tests - Migrated CountMinSketch from ByteBuffer to MemorySegment for serialization/deserialization
- Added comprehensive validation and improved error messages for sketch parameters
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java | Updates CPC test to use MemorySegment instead of Memory for FFM compatibility |
| src/test/java/org/apache/datasketches/count/CountMinSketchTest.java | Simplifies test serialization by using new toByteArray() method |
| src/main/java/org/apache/datasketches/count/CountMinSketch.java | Major refactoring to use MemorySegment, adds validation, and introduces thread-local optimization |
Comments suppressed due to low confidence (4)
src/main/java/org/apache/datasketches/count/CountMinSketch.java:26
- The import
org.apache.datasketches.common.positional.PositionalSegmentmay not exist in the current version of the datasketches library. Please verify this class is available in your version or consider using standard MemorySegment operations directly.
import org.apache.datasketches.common.positional.PositionalSegment;
src/main/java/org/apache/datasketches/count/CountMinSketch.java:104
- The memory calculation in the error message is incorrect. It uses integer division which will often result in 0 GB for realistic sketch sizes. Consider using floating-point division:
String.format("%.2f", totalSize * Long.BYTES / (1024.0 * 1024.0 * 1024.0))
+ " = " + totalSize + " elements (~" + String.format("%d", totalSize * Long.BYTES / (1024 * 1024 * 1024)) + " GB). "
src/main/java/org/apache/datasketches/count/CountMinSketch.java:406
- The
PositionalSegment.wrap()method may not exist in the current datasketches version. If PositionalSegment is not available, consider using MemorySegment directly with appropriate offset tracking.
final PositionalSegment posSeg = PositionalSegment.wrap(MemorySegment.ofArray(bytes));
src/main/java/org/apache/datasketches/count/CountMinSketch.java:448
- The
PositionalSegment.wrap()method may not exist in the current datasketches version. If PositionalSegment is not available, consider using MemorySegment directly with appropriate offset tracking.
final PositionalSegment posSeg = PositionalSegment.wrap(MemorySegment.ofArray(b));
|
Sorry I haven’t gotten to this, I’ve been busy with some “Honey, do … !” Assignments. Will try to review this tomorrow 😁 |
leerho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work!
Fix build following conflict introduced by concurrent change in #667 and #666
This pull request introduces uniformisation with the rest of the library to the
CountMinSketchimplementation, including improved serialization, memory efficiency, and validation checks. It also updates test cases to reflect these changes. The most important changes are grouped below by theme:Core Improvements to
CountMinSketch:MemorySegmentfor efficient long-to-byte conversions, reducing allocations in hot paths (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2].numHashesandnumBucketsparameters, including checks for overflow and better error messages (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL60-R135).ByteBufferusage withMemorySegmentfor serialization/deserialization, improving performance and simplifying code (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL387-R492).toByteArraymethod for serialization and refactored the deserialization logic for better maintainability (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2].Code Simplifications and Optimizations:
getHashes,update, andgetEstimateto useMemorySegmentfor byte array conversions and improved readability with consistent use offinalvariables (src/main/java/org/apache/datasketches/count/CountMinSketch.java, [1] [2] [3] [4].getEstimateto avoid redundant processing (src/main/java/org/apache/datasketches/count/CountMinSketch.java, src/main/java/org/apache/datasketches/count/CountMinSketch.javaL242-R291).Test Case Updates:
toByteArraymethod for serialization and adjusted assertions for compatibility with the refactored deserialization logic (src/test/java/org/apache/datasketches/count/CountMinSketchTest.java, [1] [2].CpcSketchCrossLanguageTestto useMemorySegmentfor consistency with the updated serialization approach (src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.java, src/test/java/org/apache/datasketches/cpc/CpcSketchCrossLanguageTest.javaL92-R92).