Skip to content

Conversation

@PsiACE
Copy link
Member

@PsiACE PsiACE commented Jan 6, 2026

Density sketch implementation for density estimation from streaming data.

Ref https://github.com/apache/datasketches-cpp/tree/master/density

@tisonkun tisonkun changed the title feat: imple densitysketch feat: impl densitysketch Jan 7, 2026
Signed-off-by: Chojan Shang <psiace@apache.org>

/// Density sketch for streaming density estimation.
pub struct DensitySketch<T: DensityValue> {
kernel: Box<dyn DensityKernel<T>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the any reasons not to keep the kernel as a generic argument of type K: DensityKernel<T>?

Copy link
Member

@tisonkun tisonkun Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be pub struct DensitySketch<T: DensityValue, K: DensityKernel<T> = GaussianKernel> { .. } but that would have some other trade-offs.

}
}

fn check_k(k: u16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this directly where it's called. I think having asserts inline really helps with the readability when people are exploring code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 ditto other checks below that are only used once.

}

thread_local! {
static RNG_STATE: Cell<u64> = Cell::new(seed_rng());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love not to have thread local variables declared statically in the library. Is this really something we cannot store in another struct and pass it to the sketch when needed, delegating lifecycle decisions to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems C++ uses thread locals while Java relies on the java.util.Random.

Unfortunately, Rust doesn't have a std random support so we have to find a reasonable way here. At least the random common should live under datasketches::random_common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jmalkin @leerho

Both for random usage and more background of density sketch, which seems a new sketch I'm not yet familiar with.

@tisonkun
Copy link
Member

Seems the paper reference is this one: https://arxiv.org/abs/2102.12301?

use crate::error::ErrorKind;

/// Floating point types supported by the density sketch.
pub trait DensityValue: Copy + PartialOrd + 'static {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be sealed and thus has a size method to return the mem size.

Comment on lines +198 to +213
/// Deserializes a sketch from a reader using the Gaussian kernel.
pub fn deserialize_from_reader(reader: &mut dyn Read) -> Result<Self, Error> {
Self::deserialize_from_reader_with_kernel(reader, Box::new(GaussianKernel))
}

/// Deserializes a sketch from a reader using the provided kernel.
pub fn deserialize_from_reader_with_kernel(
reader: &mut dyn Read,
kernel: Box<dyn DensityKernel<T>>,
) -> Result<Self, Error> {
let mut buf = Vec::new();
reader
.read_to_end(&mut buf)
.map_err(|err| Error::deserial(format!("error reading stream: {err}")))?;
Self::deserialize_with_kernel(&buf, kernel)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you consider this is good to have?

Generally, supporting std::io::Read is for streaming reading. If, as here, we just read_to_end and deserialize with the full bytes, I'd prefer to leave this work to the user and keep our logics less.

Comment on lines +63 to +66
pub trait DensityKernel<T: DensityValue> {
/// Returns the kernel evaluation for the two points.
fn evaluate(&self, left: &[T], right: &[T]) -> T;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub trait DensityKernel<T: DensityValue> {
/// Returns the kernel evaluation for the two points.
fn evaluate(&self, left: &[T], right: &[T]) -> T;
}
pub trait DensityKernel {
/// Returns the kernel evaluation for the two points.
fn evaluate<T: DensityValue>(&self, left: &[T], right: &[T]) -> T;
}

The trait bound doesn't seem to be related to DensityValue but it should accept any DensityValue.

Copy link
Member

@tisonkun tisonkun Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. SphericalKernel works only for f32.

Nope. SphericalKernel is only in testing scope and I don't think it's reasonable. Why do we need this "kernel" to be configurable if our out-of-the-box "kernel" has only GaussianKernel?

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments above.

Also I need some time to understand what densitysketch is and how it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants