Skip to content

feat(theta): introduce intersection theta set operation#100

Open
ZENOTME wants to merge 5 commits intoapache:mainfrom
ZENOTME:refine_theta_sketch
Open

feat(theta): introduce intersection theta set operation#100
ZENOTME wants to merge 5 commits intoapache:mainfrom
ZENOTME:refine_theta_sketch

Conversation

@ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Feb 19, 2026

This PR:

  • introduce ThetaSketchView trait and enhance ThetaHashTable used for set operation
  • introduce intersection theta set operation

@ZENOTME ZENOTME force-pushed the refine_theta_sketch branch 3 times, most recently from bb8b9fd to 9f17971 Compare February 19, 2026 12:53
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 19, 2026

cc @tisonkun @PsiACE @notfilippo

@ZENOTME ZENOTME force-pushed the refine_theta_sketch branch from 9f17971 to 054107c Compare February 19, 2026 12:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the ThetaHashTable and introduces the ThetaSketchView trait to prepare for implementing set operations (union, intersection, difference) on theta sketches. The changes improve the API design by separating hash computation from insertion logic and clarifying the semantics of emptiness.

Changes:

  • Introduced ThetaSketchView trait providing a unified read-only interface for both mutable ThetaSketch and immutable CompactThetaSketch
  • Refactored ThetaHashTable to separate hash computation (hash) from insertion with theta screening (try_insert_hash)
  • Changed is_empty semantics to track logical emptiness (whether updates were attempted) rather than physical emptiness (whether entries exist)
  • Renamed num_entries to num_retained for clarity
  • Added new_with_state constructor and from_parts factory method for creating tables/sketches with explicit state

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
datasketches/src/theta/sketch.rs Introduced ThetaSketchView trait, implemented it for ThetaSketch and CompactThetaSketch, added seed_hash() method, refactored update() to use try_insert(), added from_parts() constructor
datasketches/src/theta/mod.rs Exported ThetaSketchView trait
datasketches/src/theta/hash_table.rs Split hash_and_screen into hash() and try_insert_hash(), added is_empty field for logical emptiness, renamed num_entries to num_retained, added new_with_state() constructor, made REBUILD_THRESHOLD pub(crate), updated all tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Force screening and verify insertion fails
table.theta = 0;
assert!(!table.try_insert("screened"));
assert_eq!(table.num_retained(), 1);
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After setting theta to 0, the table should still be considered non-empty because "screened" was attempted to be inserted (even though it was screened out). However, this test doesn't verify that is_empty remains false after the failed insertion. Consider adding an assertion like assert!(!table.is_empty()); after line 424 to verify the semantic distinction between logical emptiness and having zero retained entries.

Suggested change
assert_eq!(table.num_retained(), 1);
assert_eq!(table.num_retained(), 1);
assert!(!table.is_empty());

Copilot uses AI. Check for mistakes.
ordered,
empty,
}
CompactThetaSketch::from_parts(entries, theta, self.table.seed_hash(), ordered, empty)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring to use from_parts is good, but note that the empty parameter passed to from_parts is computed from entries.is_empty() (line 239), which checks if there are retained hashes. However, with the new is_empty semantics in ThetaHashTable that tracks logical emptiness (whether any updates were attempted), there's now a potential inconsistency: if a ThetaSketch has been updated but all values were screened out, self.table.is_empty() would return false, but the CompactThetaSketch created here would have empty = true. This may be intentional for CompactThetaSketch, but it creates a semantic difference between sketch.is_empty() and sketch.compact(false).is_empty().

Copilot uses AI. Check for mistakes.
///
/// Returns the hash value if it passes the theta threshold, otherwise 0.
pub fn hash_and_screen<T: Hash>(&mut self, value: T) -> u64 {
/// Hash a value with the table seed and return the hash.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the hash method should clarify that it only computes the hash value without applying theta screening. This is an important distinction from the previous hash_and_screen method. Consider adding a note like: "Note: This method only computes the hash and does not apply theta screening. Use try_insert or try_insert_hash for insertion with screening."

Suggested change
/// Hash a value with the table seed and return the hash.
/// Hash a value with the table seed and return the hash.
///
/// Note: This method only computes the hash and does not apply theta screening.
/// For insertion with theta screening, use [`try_insert`] or [`try_insert_hash`].

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +65
// Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g.
// all updates screened by theta).
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for the is_empty field could be more precise. Consider rephrasing to: "Logical emptiness of the source set. False if any update has been attempted (even if screened by theta), true if no updates have been attempted. This can be false even when num_retained is 0." This makes it clearer that it tracks update attempts, not just screening.

Suggested change
// Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g.
// all updates screened by theta).
// Logical emptiness of the source set. False if any update has been attempted (even if
// screened by theta), true if no updates have been attempted. This can be false even when
// `num_retained` is 0.

Copilot uses AI. Check for mistakes.
Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you directly implement all the requirements (set ops).

These abstractions look intermediate - I can't judge whether they are correct isolated.

@ZENOTME ZENOTME changed the title refactor(theta): introduce ThetaSketchView trait and enhance ThetaHashTable used for set operation feat(theta): introduce intersection theta set operation Feb 22, 2026
/// # Panics
///
/// Panics if `lg_cur_size > lg_nom_size + 1`. (`lg_nom_size + 1 == lg_max_size`)
pub fn from_raw_parts(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is mainly used in intersection, and the current implementation is the same as C++. Use the builder pattern, maybe clearer. E.g.

// 
let hash_table = 
      ThetaHashTable::builder(lg_nom_size, resize_factor, hash_seed)
            .with_sampling_probability(...) // infer the theta from sampling_probability
            .with_log_cur_size(...); // set log_cur_size manually, ensure log_cur_size is valid. If not set, the log cur size will be inferred from log_nom_size.

cc @tisonkun

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 22, 2026

I suggest you directly implement all the requirements (set ops).

These abstractions look intermediate - I can't judge whether they are correct isolated.

Have push commit for intersection op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants