feat(theta): introduce intersection theta set operation by ZENOTME · Pull Request #100 · apache/datasketches-rust

ZENOTME · 2026-02-19T12:47:32Z

This PR:

introduce ThetaSketchView trait and enhance ThetaHashTable used for set operation
introduce intersection theta set operation

ZENOTME · 2026-02-19T12:54:01Z

cc @tisonkun @PsiACE @notfilippo

…hTable used for set operation

datasketches/src/theta/hash_table.rs

Copilot

Pull request overview

This PR refactors the ThetaHashTable and introduces the ThetaSketchView trait to prepare for implementing set operations (union, intersection, difference) on theta sketches. The changes improve the API design by separating hash computation from insertion logic and clarifying the semantics of emptiness.

Changes:

Introduced ThetaSketchView trait providing a unified read-only interface for both mutable ThetaSketch and immutable CompactThetaSketch
Refactored ThetaHashTable to separate hash computation (hash) from insertion with theta screening (try_insert_hash)
Changed is_empty semantics to track logical emptiness (whether updates were attempted) rather than physical emptiness (whether entries exist)
Renamed num_entries to num_retained for clarity
Added new_with_state constructor and from_parts factory method for creating tables/sketches with explicit state

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
datasketches/src/theta/sketch.rs	Introduced ThetaSketchView trait, implemented it for ThetaSketch and CompactThetaSketch, added seed_hash() method, refactored update() to use try_insert(), added from_parts() constructor
datasketches/src/theta/mod.rs	Exported ThetaSketchView trait
datasketches/src/theta/hash_table.rs	Split hash_and_screen into hash() and try_insert_hash(), added is_empty field for logical emptiness, renamed num_entries to num_retained, added new_with_state() constructor, made REBUILD_THRESHOLD pub(crate), updated all tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T12:40:31Z

datasketches/src/theta/hash_table.rs

+        // Force screening and verify insertion fails
+        table.theta = 0;
+        assert!(!table.try_insert("screened"));
+        assert_eq!(table.num_retained(), 1);


After setting theta to 0, the table should still be considered non-empty because "screened" was attempted to be inserted (even though it was screened out). However, this test doesn't verify that is_empty remains false after the failed insertion. Consider adding an assertion like assert!(!table.is_empty()); after line 424 to verify the semantic distinction between logical emptiness and having zero retained entries.

Suggested change

assert_eq!(table.num_retained(), 1);

assert_eq!(table.num_retained(), 1);

assert!(!table.is_empty());

Copilot · 2026-02-20T12:40:32Z

datasketches/src/theta/sketch.rs

-            ordered,
-            empty,
-        }
+        CompactThetaSketch::from_parts(entries, theta, self.table.seed_hash(), ordered, empty)


The refactoring to use from_parts is good, but note that the empty parameter passed to from_parts is computed from entries.is_empty() (line 239), which checks if there are retained hashes. However, with the new is_empty semantics in ThetaHashTable that tracks logical emptiness (whether any updates were attempted), there's now a potential inconsistency: if a ThetaSketch has been updated but all values were screened out, self.table.is_empty() would return false, but the CompactThetaSketch created here would have empty = true. This may be intentional for CompactThetaSketch, but it creates a semantic difference between sketch.is_empty() and sketch.compact(false).is_empty().

Copilot · 2026-02-20T12:40:32Z

datasketches/src/theta/hash_table.rs

-    ///
-    /// Returns the hash value if it passes the theta threshold, otherwise 0.
-    pub fn hash_and_screen<T: Hash>(&mut self, value: T) -> u64 {
+    /// Hash a value with the table seed and return the hash.


The documentation for the hash method should clarify that it only computes the hash value without applying theta screening. This is an important distinction from the previous hash_and_screen method. Consider adding a note like: "Note: This method only computes the hash and does not apply theta screening. Use try_insert or try_insert_hash for insertion with screening."

Suggested change

/// Hash a value with the table seed and return the hash.

/// Hash a value with the table seed and return the hash.

///

/// Note: This method only computes the hash and does not apply theta screening.

/// For insertion with theta screening, use [`try_insert`] or [`try_insert_hash`].

Copilot · 2026-02-20T12:40:32Z

datasketches/src/theta/hash_table.rs

+    // Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g.
+    // all updates screened by theta).


The comment for the is_empty field could be more precise. Consider rephrasing to: "Logical emptiness of the source set. False if any update has been attempted (even if screened by theta), true if no updates have been attempted. This can be false even when num_retained is 0." This makes it clearer that it tracks update attempts, not just screening.

Suggested change

// Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g.

// all updates screened by theta).

// Logical emptiness of the source set. False if any update has been attempted (even if

// screened by theta), true if no updates have been attempted. This can be false even when

// `num_retained` is 0.

tisonkun

I suggest you directly implement all the requirements (set ops).

These abstractions look intermediate - I can't judge whether they are correct isolated.

ZENOTME · 2026-02-22T03:20:42Z

datasketches/src/theta/hash_table.rs

+    /// # Panics
+    ///
+    /// Panics if `lg_cur_size > lg_nom_size + 1`. (`lg_nom_size + 1 == lg_max_size`)
+    pub fn from_raw_parts(


This function is mainly used in intersection, and the current implementation is the same as C++. Use the builder pattern, maybe clearer. E.g.

// let hash_table = ThetaHashTable::builder(lg_nom_size, resize_factor, hash_seed) .with_sampling_probability(...) // infer the theta from sampling_probability .with_log_cur_size(...); // set log_cur_size manually, ensure log_cur_size is valid. If not set, the log cur size will be inferred from log_nom_size.

cc @tisonkun

ZENOTME · 2026-02-22T03:21:11Z

I suggest you directly implement all the requirements (set ops).

These abstractions look intermediate - I can't judge whether they are correct isolated.

Have push commit for intersection op

ZENOTME force-pushed the refine_theta_sketch branch 3 times, most recently from bb8b9fd to 9f17971 Compare February 19, 2026 12:53

refactor(theta): introduce ThetaSketchView trait and enhance ThetaHas…

054107c

…hTable used for set operation

ZENOTME force-pushed the refine_theta_sketch branch from 9f17971 to 054107c Compare February 19, 2026 12:54

tisonkun reviewed Feb 20, 2026

View reviewed changes

datasketches/src/theta/hash_table.rs Show resolved Hide resolved

ZENOTME and others added 2 commits February 20, 2026 20:25

refine code: remove unused code

38440dc

Merge branch 'main' into refine_theta_sketch

933dcf7

tisonkun requested review from Copilot and tisonkun February 20, 2026 12:35

Copilot started reviewing on behalf of tisonkun February 20, 2026 12:35 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

tisonkun reviewed Feb 21, 2026

View reviewed changes

feat(theta): support theta intersection set operation

d7c81fd

ZENOTME changed the title ~~refactor(theta): introduce ThetaSketchView trait and enhance ThetaHashTable used for set operation~~ feat(theta): introduce intersection theta set operation Feb 22, 2026

ZENOTME commented Feb 22, 2026

View reviewed changes

fix lint

c3098c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(theta): introduce intersection theta set operation#100

feat(theta): introduce intersection theta set operation#100
ZENOTME wants to merge 5 commits intoapache:mainfrom
ZENOTME:refine_theta_sketch

ZENOTME commented Feb 19, 2026 •

edited

Loading

Uh oh!

ZENOTME commented Feb 19, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

tisonkun left a comment

Uh oh!

ZENOTME Feb 22, 2026

Uh oh!

ZENOTME commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	assert_eq!(table.num_retained(), 1);
	assert_eq!(table.num_retained(), 1);
	assert!(!table.is_empty());

		// Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g.
		// all updates screened by theta).

-    // Logical emptiness of the source set. This can be false even when `num_retained` is 0 (e.g.
-    // all updates screened by theta).
+    // Logical emptiness of the source set. False if any update has been attempted (even if
+    // screened by theta), true if no updates have been attempted. This can be false even when
+    // `num_retained` is 0.

Conversation

ZENOTME commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZENOTME commented Feb 19, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun left a comment

Choose a reason for hiding this comment

Uh oh!

ZENOTME Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

ZENOTME commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZENOTME commented Feb 19, 2026 •

edited

Loading