Skip to content

Track unique shortids seen#1459

Draft
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:unique-shortid-metrics
Draft

Track unique shortids seen#1459
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:unique-shortid-metrics

Conversation

@zealsham
Copy link
Copy Markdown
Collaborator

@zealsham zealsham commented Mar 31, 2026

This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.

This is still a work in progress
current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time, although the internal data to answer those queries exists (168 hourly sketches, 90 daily sketches are stored in memory), but it's not exposed yet.

Pull Request Checklist

Please confirm the following before requesting review:

This pr addresses payjoin#1284. it counts cardinality for shortids over
hourly, daily and weekly time period.
@zealsham zealsham marked this pull request as draft March 31, 2026 19:28
@coveralls
Copy link
Copy Markdown
Collaborator

Pull Request Test Coverage Report for Build 23815515937

Details

  • 130 of 130 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.378%

Totals Coverage Status
Change from base Build 23766203897: 0.2%
Covered Lines: 10862
Relevant Lines: 12873

💛 - Coveralls

let hour = secs / 3600;
let day = secs / 86400;

self.hourly.entry(hour).or_insert_with(new_sketch).insert(&id.0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr - hashing &id before insertion here would eliminate a theoretical leak that may result here

since we don't know what hashing HyperLogLogPlus does internally, i think there's a chance that with access to the metrics and access to a list of candidate short IDs, an adversary may be able to confirm if a subset of the short IDs was included, if they are lucky (i.e. those are the extremal values selected).

to prevent this kind of leak the short IDs could be hashed (keyed hash, where the key is secret) before insertion into the HLL sketch, the resulting keys would be a 1:1 mapping (well, almost.... some hash functions are bijective for u64, but i don't think that's really valuable since the chance of collisions on non-duplicate entries is basically 0 for the cardinality we expect)

so basically, by hashing before insertions with some keyed hash (or even a keyed cryptographic hash even), we can ensure that even the complete HLL sketch if published will not leak information about specific short IDs that were used during the relevant time period.

the overhead of even a cryptographic hash here should be negligible, so perhaps the safest approach is to compute a salted SHA256 of the short ID or something like that, but i think a non-cryptographic keyed hash is sufficient since collision resistance against adversarial inputs subsumes not being able to reverse the hash at least if the salt isn't leaked...

if HyperLogLogPlus's hashing is sufficient, perhaps we can control it and ensure that it is adequately keyed, then pinning it as a dependency (to ensure there's no regression) is arguably sufficient, but i think it's easier to reason that the cost of hashing one more time is basically 0 and worth the peace of mind here

@nothingmuch
Copy link
Copy Markdown
Contributor

This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.

Concept ACK. Code looks good from a cursory read, I will review more thoroughly when it's out of draft

This is still a work in progress current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time

in principle HLL sketches can be merged, which means we can take high frequency sketches and aggregate them together in response to certain queries

if cardinality estimation is added to prometheus directly then that would be possible in theory (prometheus/prometheus#12233) without any additional considerations, but until then i think it's perfectly fine to just output u64s at enough temporal resolutions to be useful and not allow estimating count distinct over longer durations. my only suggestion here is to also add a monthly aggregation interval as well.

note that my comment about hashing short IDs before cardinality estimation only really makes any difference if native caridnality estimation is supported in prometheus directly, so i would also be satisfied with just a comment that says if cardinality estimation is later supported, IDs should be hashed before giving them to prometheus, but for now if we only publish the cardinality estimate resulting from the sketches and the sketches themselves only reside in the mailroom's RAM then there's no need for that additional hashing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants