Conversation
This pr addresses payjoin#1284. it counts cardinality for shortids over hourly, daily and weekly time period.
Pull Request Test Coverage Report for Build 23815515937Details
💛 - Coveralls |
| let hour = secs / 3600; | ||
| let day = secs / 86400; | ||
|
|
||
| self.hourly.entry(hour).or_insert_with(new_sketch).insert(&id.0); |
There was a problem hiding this comment.
tl;dr - hashing &id before insertion here would eliminate a theoretical leak that may result here
since we don't know what hashing HyperLogLogPlus does internally, i think there's a chance that with access to the metrics and access to a list of candidate short IDs, an adversary may be able to confirm if a subset of the short IDs was included, if they are lucky (i.e. those are the extremal values selected).
to prevent this kind of leak the short IDs could be hashed (keyed hash, where the key is secret) before insertion into the HLL sketch, the resulting keys would be a 1:1 mapping (well, almost.... some hash functions are bijective for u64, but i don't think that's really valuable since the chance of collisions on non-duplicate entries is basically 0 for the cardinality we expect)
so basically, by hashing before insertions with some keyed hash (or even a keyed cryptographic hash even), we can ensure that even the complete HLL sketch if published will not leak information about specific short IDs that were used during the relevant time period.
the overhead of even a cryptographic hash here should be negligible, so perhaps the safest approach is to compute a salted SHA256 of the short ID or something like that, but i think a non-cryptographic keyed hash is sufficient since collision resistance against adversarial inputs subsumes not being able to reverse the hash at least if the salt isn't leaked...
if HyperLogLogPlus's hashing is sufficient, perhaps we can control it and ensure that it is adequately keyed, then pinning it as a dependency (to ensure there's no regression) is arguably sufficient, but i think it's easier to reason that the cost of hashing one more time is basically 0 and worth the peace of mind here
Concept ACK. Code looks good from a cursory read, I will review more thoroughly when it's out of draft
in principle HLL sketches can be merged, which means we can take high frequency sketches and aggregate them together in response to certain queries if cardinality estimation is added to prometheus directly then that would be possible in theory (prometheus/prometheus#12233) without any additional considerations, but until then i think it's perfectly fine to just output u64s at enough temporal resolutions to be useful and not allow estimating count distinct over longer durations. my only suggestion here is to also add a monthly aggregation interval as well. note that my comment about hashing short IDs before cardinality estimation only really makes any difference if native caridnality estimation is supported in prometheus directly, so i would also be satisfied with just a comment that says if cardinality estimation is later supported, IDs should be hashed before giving them to prometheus, but for now if we only publish the cardinality estimate resulting from the sketches and the sketches themselves only reside in the mailroom's RAM then there's no need for that additional hashing |
This pr addresses #1284. it counts cardinality for shortids over hourly, daily and weekly time period.
This is still a work in progress
current bottle-neck is the gauges are a fix snapshots which doesnt let you query the amount of unique ids seen for X time, although the internal data to answer those queries exists (168 hourly sketches, 90 daily sketches are stored in memory), but it's not exposed yet.
Pull Request Checklist
Please confirm the following before requesting review:
AI
in the body of this PR.