-
Notifications
You must be signed in to change notification settings - Fork 11
feat: impl densitysketch #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chojan Shang <psiace@apache.org>
|
|
||
| /// Density sketch for streaming density estimation. | ||
| pub struct DensitySketch<T: DensityValue> { | ||
| kernel: Box<dyn DensityKernel<T>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the any reasons not to keep the kernel as a generic argument of type K: DensityKernel<T>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be pub struct DensitySketch<T: DensityValue, K: DensityKernel<T> = GaussianKernel> { .. } but that would have some other trade-offs.
| } | ||
| } | ||
|
|
||
| fn check_k(k: u16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this directly where it's called. I think having asserts inline really helps with the readability when people are exploring code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 ditto other checks below that are only used once.
| } | ||
|
|
||
| thread_local! { | ||
| static RNG_STATE: Cell<u64> = Cell::new(seed_rng()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love not to have thread local variables declared statically in the library. Is this really something we cannot store in another struct and pass it to the sketch when needed, delegating lifecycle decisions to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems C++ uses thread locals while Java relies on the java.util.Random.
Unfortunately, Rust doesn't have a std random support so we have to find a reasonable way here. At least the random common should live under datasketches::random_common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Seems the paper reference is this one: https://arxiv.org/abs/2102.12301? |
| use crate::error::ErrorKind; | ||
|
|
||
| /// Floating point types supported by the density sketch. | ||
| pub trait DensityValue: Copy + PartialOrd + 'static { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be sealed and thus has a size method to return the mem size.
| /// Deserializes a sketch from a reader using the Gaussian kernel. | ||
| pub fn deserialize_from_reader(reader: &mut dyn Read) -> Result<Self, Error> { | ||
| Self::deserialize_from_reader_with_kernel(reader, Box::new(GaussianKernel)) | ||
| } | ||
|
|
||
| /// Deserializes a sketch from a reader using the provided kernel. | ||
| pub fn deserialize_from_reader_with_kernel( | ||
| reader: &mut dyn Read, | ||
| kernel: Box<dyn DensityKernel<T>>, | ||
| ) -> Result<Self, Error> { | ||
| let mut buf = Vec::new(); | ||
| reader | ||
| .read_to_end(&mut buf) | ||
| .map_err(|err| Error::deserial(format!("error reading stream: {err}")))?; | ||
| Self::deserialize_with_kernel(&buf, kernel) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you consider this is good to have?
Generally, supporting std::io::Read is for streaming reading. If, as here, we just read_to_end and deserialize with the full bytes, I'd prefer to leave this work to the user and keep our logics less.
| pub trait DensityKernel<T: DensityValue> { | ||
| /// Returns the kernel evaluation for the two points. | ||
| fn evaluate(&self, left: &[T], right: &[T]) -> T; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub trait DensityKernel<T: DensityValue> { | |
| /// Returns the kernel evaluation for the two points. | |
| fn evaluate(&self, left: &[T], right: &[T]) -> T; | |
| } | |
| pub trait DensityKernel { | |
| /// Returns the kernel evaluation for the two points. | |
| fn evaluate<T: DensityValue>(&self, left: &[T], right: &[T]) -> T; | |
| } |
The trait bound doesn't seem to be related to DensityValue but it should accept any DensityValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. SphericalKernel works only for f32.
Nope. SphericalKernel is only in testing scope and I don't think it's reasonable. Why do we need this "kernel" to be configurable if our out-of-the-box "kernel" has only GaussianKernel?
tisonkun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments above.
Also I need some time to understand what densitysketch is and how it works.
Density sketch implementation for density estimation from streaming data.
Ref https://github.com/apache/datasketches-cpp/tree/master/density