feat(theta): introduce intersection theta set operation#100
feat(theta): introduce intersection theta set operation#100ZENOTME wants to merge 5 commits intoapache:mainfrom
Conversation
bb8b9fd to
9f17971
Compare
…hTable used for set operation
9f17971 to
054107c
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the ThetaHashTable and introduces the ThetaSketchView trait to prepare for implementing set operations (union, intersection, difference) on theta sketches. The changes improve the API design by separating hash computation from insertion logic and clarifying the semantics of emptiness.
Changes:
- Introduced
ThetaSketchViewtrait providing a unified read-only interface for both mutableThetaSketchand immutableCompactThetaSketch - Refactored
ThetaHashTableto separate hash computation (hash) from insertion with theta screening (try_insert_hash) - Changed
is_emptysemantics to track logical emptiness (whether updates were attempted) rather than physical emptiness (whether entries exist) - Renamed
num_entriestonum_retainedfor clarity - Added
new_with_stateconstructor andfrom_partsfactory method for creating tables/sketches with explicit state
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| datasketches/src/theta/sketch.rs | Introduced ThetaSketchView trait, implemented it for ThetaSketch and CompactThetaSketch, added seed_hash() method, refactored update() to use try_insert(), added from_parts() constructor |
| datasketches/src/theta/mod.rs | Exported ThetaSketchView trait |
| datasketches/src/theta/hash_table.rs | Split hash_and_screen into hash() and try_insert_hash(), added is_empty field for logical emptiness, renamed num_entries to num_retained, added new_with_state() constructor, made REBUILD_THRESHOLD pub(crate), updated all tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Force screening and verify insertion fails | ||
| table.theta = 0; | ||
| assert!(!table.try_insert("screened")); | ||
| assert_eq!(table.num_retained(), 1); |
There was a problem hiding this comment.
After setting theta to 0, the table should still be considered non-empty because "screened" was attempted to be inserted (even though it was screened out). However, this test doesn't verify that is_empty remains false after the failed insertion. Consider adding an assertion like assert!(!table.is_empty()); after line 424 to verify the semantic distinction between logical emptiness and having zero retained entries.
| assert_eq!(table.num_retained(), 1); | |
| assert_eq!(table.num_retained(), 1); | |
| assert!(!table.is_empty()); |
| ordered, | ||
| empty, | ||
| } | ||
| CompactThetaSketch::from_parts(entries, theta, self.table.seed_hash(), ordered, empty) |
There was a problem hiding this comment.
The refactoring to use from_parts is good, but note that the empty parameter passed to from_parts is computed from entries.is_empty() (line 239), which checks if there are retained hashes. However, with the new is_empty semantics in ThetaHashTable that tracks logical emptiness (whether any updates were attempted), there's now a potential inconsistency: if a ThetaSketch has been updated but all values were screened out, self.table.is_empty() would return false, but the CompactThetaSketch created here would have empty = true. This may be intentional for CompactThetaSketch, but it creates a semantic difference between sketch.is_empty() and sketch.compact(false).is_empty().
| /// | ||
| /// Returns the hash value if it passes the theta threshold, otherwise 0. | ||
| pub fn hash_and_screen<T: Hash>(&mut self, value: T) -> u64 { | ||
| /// Hash a value with the table seed and return the hash. |
There was a problem hiding this comment.
The documentation for the hash method should clarify that it only computes the hash value without applying theta screening. This is an important distinction from the previous hash_and_screen method. Consider adding a note like: "Note: This method only computes the hash and does not apply theta screening. Use try_insert or try_insert_hash for insertion with screening."
| /// Hash a value with the table seed and return the hash. | |
| /// Hash a value with the table seed and return the hash. | |
| /// | |
| /// Note: This method only computes the hash and does not apply theta screening. | |
| /// For insertion with theta screening, use [`try_insert`] or [`try_insert_hash`]. |
| // Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g. | ||
| // all updates screened by theta). |
There was a problem hiding this comment.
The comment for the is_empty field could be more precise. Consider rephrasing to: "Logical emptiness of the source set. False if any update has been attempted (even if screened by theta), true if no updates have been attempted. This can be false even when num_retained is 0." This makes it clearer that it tracks update attempts, not just screening.
| // Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g. | |
| // all updates screened by theta). | |
| // Logical emptiness of the source set. False if any update has been attempted (even if | |
| // screened by theta), true if no updates have been attempted. This can be false even when | |
| // `num_retained` is 0. |
tisonkun
left a comment
There was a problem hiding this comment.
I suggest you directly implement all the requirements (set ops).
These abstractions look intermediate - I can't judge whether they are correct isolated.
| /// # Panics | ||
| /// | ||
| /// Panics if `lg_cur_size > lg_nom_size + 1`. (`lg_nom_size + 1 == lg_max_size`) | ||
| pub fn from_raw_parts( |
There was a problem hiding this comment.
This function is mainly used in intersection, and the current implementation is the same as C++. Use the builder pattern, maybe clearer. E.g.
//
let hash_table =
ThetaHashTable::builder(lg_nom_size, resize_factor, hash_seed)
.with_sampling_probability(...) // infer the theta from sampling_probability
.with_log_cur_size(...); // set log_cur_size manually, ensure log_cur_size is valid. If not set, the log cur size will be inferred from log_nom_size.
cc @tisonkun
Have push commit for intersection op |
This PR: