了解 Count Sketch 数据结构和相关算法

发布于 2024-12-25 02:05:24 字数 628 浏览 0 评论 0原文

致力于了解 CountSketch 数据结构及其相关算法。它似乎是寻找流数据中常见元素的一个很好的工具,它的附加性质使得它具有一些有趣的属性,可以发现频率的巨大变化,也许类似于 Twitter 用于趋势主题的东西。

论文对于那些已经接触过的人来说有点难以理解暂时远离更多的学术方法,这里有上一篇文章确实帮助了一些,至少对我来说仍然留下了很多问题。

据我了解,Count Sketch 结构类似于布隆过滤器。然而,哈希函数的选择让我感到困惑。该结构是一个 N × M 表,具有 N 个散列函数,其中 M 个可能的值确定要更改的“桶”,每个 N 的另一个散列函数 s 是“成对独立”的,

散列是从通用散列族中选择的,说一下 h(x) = ((ax+b) % some_prime) % M 吗?

如果是这样,返回 +1 或 -1 的 s 哈希值是从哪里选择的?从其中一个桶中减去的原因是什么?

Working on wrapping my head around the CountSketch data structure and its associated algorithms. It seems to be a great tool for finding common elements in streaming data, and the additive nature of it makes for some fun properties with finding large changes in frequency, perhaps similar to what Twitter uses for trending topics.

The paper is a little difficult to understand for someone that has been away from more academic approaches for a while, and a previous post here did help some, for me at least it still left quite a few questions.

As I understand it, the Count Sketch structure is similar to a bloom filter. However the selection of hash functions has me confused. The structure is an N by M table with N hash functions with M possible values determining the "bucket" to alter, and another hash function s for each N that is "pairwise independent"

Are the hashes to be selected from a universal hashing family, say something of the h(x) = ((ax+b) % some_prime) % M?

And if so, where are the s hashes that return either +1 or -1 chosen from? And what is the reason for ever subtracting from one of the buckets?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

苍暮颜 2025-01-01 02:05:24

他们从存储桶中减去,以使由其他事件引起的加法/减法的平均效果为0。如果一半的时间我添加“foo”的计数,一半的时间我减去“foo”的计数,那么在期望中,“foo”的计数不会影响“bar”计数的估计。

选择像您描述的那样的通用哈希函数确实可行,但它对于理论而不是实践来说最重要。对您最喜欢的合理哈希函数加盐也可以,您只是无法使用一些固定哈希函数根据预期值有意义地编写证明。

They subtract from the buckets to make average effect of additions/subtractions caused by other occurrences to be 0. If half the time I add the count of 'foo', and half the time I subtract the count of 'foo', then in expectation, the count of 'foo' does not influence the estimate of the count for 'bar'.

Picking a universal hash function like you describe will indeed work, but it's mostly important for the theory rather than the practice. Salting your favorite reasonable hash function will work too, you just can't meaningfully write proofs based on the expected values using a few fixed hash functions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文