Track unique shortids seen by zealsham · Pull Request #1459 · payjoin/rust-payjoin

zealsham · 2026-03-31T19:27:27Z

This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.

This is still a work in progress
current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time, although the internal data to answer those queries exists (168 hourly sketches, 90 daily sketches are stored in memory), but it's not exposed yet.

Pull Request Checklist

Please confirm the following before requesting review:

I have disclosed my use of
AI
in the body of this PR.
I have read CONTRIBUTING.md and rebased my branch to produce hygienic commits.

This pr addresses payjoin#1284. it counts cardinality for shortids over hourly, daily and weekly time period.

coveralls · 2026-03-31T19:32:40Z

Pull Request Test Coverage Report for Build 23815515937

Details

130 of 130 (100.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 84.378%

Totals
Change from base Build 23766203897:	0.2%
Covered Lines:	10862
Relevant Lines:	12873

💛 - Coveralls

nothingmuch · 2026-03-31T21:05:31Z

payjoin-mailroom/src/metrics.rs

+        let hour = secs / 3600;
+        let day = secs / 86400;
+
+        self.hourly.entry(hour).or_insert_with(new_sketch).insert(&id.0);


tl;dr - hashing &id before insertion here would eliminate a theoretical leak that may result here

since we don't know what hashing HyperLogLogPlus does internally, i think there's a chance that with access to the metrics and access to a list of candidate short IDs, an adversary may be able to confirm if a subset of the short IDs was included, if they are lucky (i.e. those are the extremal values selected).

to prevent this kind of leak the short IDs could be hashed (keyed hash, where the key is secret) before insertion into the HLL sketch, the resulting keys would be a 1:1 mapping (well, almost.... some hash functions are bijective for u64, but i don't think that's really valuable since the chance of collisions on non-duplicate entries is basically 0 for the cardinality we expect)

so basically, by hashing before insertions with some keyed hash (or even a keyed cryptographic hash even), we can ensure that even the complete HLL sketch if published will not leak information about specific short IDs that were used during the relevant time period.

the overhead of even a cryptographic hash here should be negligible, so perhaps the safest approach is to compute a salted SHA256 of the short ID or something like that, but i think a non-cryptographic keyed hash is sufficient since collision resistance against adversarial inputs subsumes not being able to reverse the hash at least if the salt isn't leaked...

if HyperLogLogPlus's hashing is sufficient, perhaps we can control it and ensure that it is adequately keyed, then pinning it as a dependency (to ensure there's no regression) is arguably sufficient, but i think it's easier to reason that the cost of hashing one more time is basically 0 and worth the peace of mind here

nothingmuch · 2026-03-31T21:16:58Z

This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.

Concept ACK. Code looks good from a cursory read, I will review more thoroughly when it's out of draft

This is still a work in progress current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time

in principle HLL sketches can be merged, which means we can take high frequency sketches and aggregate them together in response to certain queries

if cardinality estimation is added to prometheus directly then that would be possible in theory (prometheus/prometheus#12233) without any additional considerations, but until then i think it's perfectly fine to just output u64s at enough temporal resolutions to be useful and not allow estimating count distinct over longer durations. my only suggestion here is to also add a monthly aggregation interval as well.

note that my comment about hashing short IDs before cardinality estimation only really makes any difference if native caridnality estimation is supported in prometheus directly, so i would also be satisfied with just a comment that says if cardinality estimation is later supported, IDs should be hashed before giving them to prometheus, but for now if we only publish the cardinality estimate resulting from the sketches and the sketches themselves only reside in the mailroom's RAM then there's no need for that additional hashing

Track unique shortids seen

718757a

This pr addresses payjoin#1284. it counts cardinality for shortids over hourly, daily and weekly time period.

zealsham marked this pull request as draft March 31, 2026 19:28

nothingmuch reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track unique shortids seen#1459

Track unique shortids seen#1459
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:unique-shortid-metrics

zealsham commented Mar 31, 2026 •

edited

Loading

Uh oh!

coveralls commented Mar 31, 2026

Uh oh!

nothingmuch Mar 31, 2026

Uh oh!

nothingmuch commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zealsham commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Mar 31, 2026

Pull Request Test Coverage Report for Build 23815515937

Details

💛 - Coveralls

Uh oh!

nothingmuch Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

nothingmuch commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zealsham commented Mar 31, 2026 •

edited

Loading