Skip to content

Conversation

@PsiACE
Copy link
Member

@PsiACE PsiACE commented Jan 14, 2026

Added halve and decay to CountMinSketch, matching cmsketch behavior for non‑negative counters.

This scales counters (and total_weight) to support exponential decay while keeping serialization unchanged. I’m not fully confident about the math details, especially around our error bounds, but the implementation itself should be straightforward. 😃

Related #70

@jmalkin
Copy link

jmalkin commented Jan 14, 2026

If the impact on error bounds is not known we need to have a large warning about this and to set a flag that refuses to return error bounds after these operations.

Lying by giving an answer when we do not have math to support it is very bad.

@PsiACE PsiACE marked this pull request as draft January 14, 2026 16:21
@c-dickens
Copy link

Think this goes through by the linear property of count min but am on phone so haven't looked at the code yet.

@tisonkun
Copy link
Member

Think this goes through by the linear property of count min but am on phone so haven't looked at the code yet.

Agree that this should be provable. Perhaps https://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf can be helpful.

@c-dickens
Copy link

To answer the question directly: this example of scaling up or down by a nonnegative value is acceptable. However, there is a knock-on effect on the error bound which needs testing. Basically, for an underlying frequency value f_i of item i the error bound is true(f_i) <= est(f_i) <= true(f_i) + error where error depends on the sketch size and the total mass in the stream. So if you are scale the inputs to the sketch (that make up each "true" f_i), then you also need to scale the total_weight - this is implemented in the PR but the altered error bounds are not tested.

In future, we do need a little care because there is some subtlety around the behaviour depending on the input stream. If only positive integers are used for addition and multiplication and the underlying stream only has positive frequencies, then things are ok. The nuance comes in around negative values in the underlying frequency vector, in which case we might prefer a CountSketch -- something useful that we haven't got round to yet!

There are some good notes here and here that make the point clear. The "standard" count min point query estimation is only valid on the assumption that all entries in the underlying frequency vector are nonnegative. Happy to give more details if need be.

In reference to @tisonkun's comment, there are a bunch of windowing/merging/aggregate approaches that might be interested. If so please add to the discussion - I am very keen to learn if you have specific applications in mind, or whether they (and other sketches eg. CountSketch) would be generally useful.

@PsiACE PsiACE marked this pull request as ready for review January 15, 2026 16:36
@PsiACE
Copy link
Member Author

PsiACE commented Jan 15, 2026

Marking this as ready for review. I added a simple test to make sure the scaling behavior basically works (single key, no collisions). We can use that as a starting point

@leerho
Copy link
Member

leerho commented Jan 16, 2026

@c-dickens,
You might check with Justin (@jthaler) as he knows Rust very well and has the math chops to verify our assumptions. Perhaps he or you could then put your approval on this PR.

Comment on lines +210 to +213
/// let mut sketch = CountMinSketch::new(4, 128);
/// sketch.update_with_weight("apple", 3);
/// sketch.decay(0.5);
/// assert!(sketch.estimate("apple") >= 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more practical example? Ditto halve

@MrCroxx how can halve and decay practically be helpful?

Co-authored-by: tison <wander4096@gmail.com>
@c-dickens
Copy link

@leerho - I have enquired with Justin to check my understanding. For others, my feeling is that halving (or scaling by some factor in (0, 1] ) is an acceptable operation. However, the sketch array has signed values, so if sketches were subtracted from one another and a negative value became present, then this would violate the error bounds. The standard (and well-known) error bounds for count min frequency estimation have a key assumption that all entries in the underlying frequency vector are non-negative so a negative value in sketch(A) - sketch(B) would contradict this assumption and thus the error bounds would not be valid.

Although the scaling itself is not a problem, it would be good to learn more about the intended application to decide the best path forward.

@tisonkun
Copy link
Member

tisonkun commented Jan 21, 2026

the sketch array has signed values

@PsiACE @MrCroxx I'm considering to generify CountMinSketch to accept T: IntNumber and only implement halve/decay for unsigned versions.

This should be the same as cmsketch, which implements only for unsigned values.

cc @c-dickens does it make sense to have this generic design?

ref - https://github.com/apache/datasketches-cpp/milestone/1

@MrCroxx
Copy link

MrCroxx commented Jan 21, 2026

the sketch array has signed values

@PsiACE @MrCroxx I'm considering to generify CountMinSketch to accept T: IntNumber and only implement halve/decay for unsigned versions.

This should be the same as cmsketch, which implements only for unsigned values.

cc @c-dickens does it make sense to have this generic design?

ref - apache/datasketches-cpp/milestone/1

Sounds good to my use case.

@c-dickens
Copy link

c-dickens commented Jan 21, 2026

I would support an unsigned version. Think we need this in cpp too.
This is in cpp but maybe more control is needed over what a user can do with it, as your comment suggests @tisonkun

@PsiACE
Copy link
Member Author

PsiACE commented Jan 21, 2026

I'm considering to generify CountMinSketch to accept T: IntNumber and only implement halve/decay for unsigned versions.

This should be the same as cmsketch, which implements only for unsigned values.

I agree. I'll get back to the code later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants