当前位置：文江博客话题详情

解释计数草图算法

发布于 2024-11-26 01:41:22 字数 152 浏览 8 评论 0原文

有人可以解释一下计数草图算法是如何工作的吗？例如，我仍然无法弄清楚如何使用哈希值。我很难理解这篇论文。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

噩梦成真你也成魔 2024-12-03 01:41:22

该流算法实例化了以下框架。

找到一个随机流算法，其输出（作为随机变量）具有所需的期望，但通常方差较高（即噪声）。
要减少方差/噪声，请并行运行许多独立副本并合并其输出。

通常1比2更有趣。这个算法的2实际上有点不标准，但我只讨论1。

假设我们用三个计数器处理输入

a b c a b a .

，则不需要散列。

a: 3, b: 2, c: 1

然而，让我们假设我们只有一个。有八个可能的函数 h : {a, b, c} -> {+1，-1}。这是结果表。

 h  |
abc |  X = counter
----+--------------
+++ | +3 +2 +1 =  6
++- | +3 +2 -1 =  4
+-- | +3 -2 -1 =  0
+-+ | +3 -2 +1 =  2
--+ | -3 -2 +1 = -4
--- | -3 -2 -1 = -6
-+- | -3 +2 -1 = -2
-++ | -3 +2 +1 =  0

现在我们可以计算期望

            (6 + 4 + 0 + 2) - (-4 + -6 + -2 + 0)
E[h(a) X] = ------------------------------------ = 24/8 = 3
                             8

            (6 + 4 + -2 + 0) - (0 + 2 + -4 + -6)
E[h(b) X] = ------------------------------------ = 16/8 = 2
                             8

            (6 + 2 + -4 + 0) - (4 + 0 + -6 + -2)
E[h(c) X] = ------------------------------------ =  8/8 = 1 .
                             8

这里发生了什么？例如，对于 a，我们可以分解 X = Y + Z，其中 Y 是 a< 总和的变化/code>s，Z 是非 a 的总和。根据期望的线性，我们有

E[h(a) X] = E[h(a) Y] + E[h(a) Z] .

E[h(a) Y] 是一个总和，其中每次出现 a 都有一个项，即 h(a) ^2 = 1，因此E[h(a) Y]是a出现的次数。另一项 E[h(a) Z] 为零；即使给定 h(a)，每个其他哈希值也同样可能加或减 1，因此对期望的贡献为零。

事实上，哈希函数不需要是均匀随机的，这是一件好事：没有办法存储它。哈希函数成对独立就足够了（任何两个特定的哈希值都是独立的）。对于我们的简单示例，随机选择以下四个函数就足够了。

abc

+++
+--
-+-
--+

我将把新的计算留给你。

This streaming algorithm instantiates the following framework.

Find a randomized streaming algorithm whose output (as a random variable) has the desired expectation but usually high variance (i.e., noise).
To reduce the variance/noise, run many independent copies in parallel and combine their outputs.

Usually 1 is more interesting than 2. This algorithm's 2 actually is somewhat nonstandard, but I'm going to talk about 1 only.

Suppose we're processing the input

a b c a b a .

With three counters, there's no need to hash.

a: 3, b: 2, c: 1

Let's suppose however that we have just one. There are eight possible functions h : {a, b, c} -> {+1, -1}. Here is a table of the outcomes.

 h  |
abc |  X = counter
----+--------------
+++ | +3 +2 +1 =  6
++- | +3 +2 -1 =  4
+-- | +3 -2 -1 =  0
+-+ | +3 -2 +1 =  2
--+ | -3 -2 +1 = -4
--- | -3 -2 -1 = -6
-+- | -3 +2 -1 = -2
-++ | -3 +2 +1 =  0

Now we can calculate expectations

            (6 + 4 + 0 + 2) - (-4 + -6 + -2 + 0)
E[h(a) X] = ------------------------------------ = 24/8 = 3
                             8

            (6 + 4 + -2 + 0) - (0 + 2 + -4 + -6)
E[h(b) X] = ------------------------------------ = 16/8 = 2
                             8

            (6 + 2 + -4 + 0) - (4 + 0 + -6 + -2)
E[h(c) X] = ------------------------------------ =  8/8 = 1 .
                             8

What's going on here? For a, say, we can decompose X = Y + Z, where Y is the change in the sum for the as, and Z is the sum for the non-as. By the linearity of expectation, we have

E[h(a) X] = E[h(a) Y] + E[h(a) Z] .

E[h(a) Y] is a sum with a term for each occurrence of a that is h(a)^2 = 1, so E[h(a) Y] is the number of occurrences of a. The other term E[h(a) Z] is zero; even given h(a), each other hash value is equally likely to be plus or minus one and so contributes zero in expectation.

In fact, the hash function doesn't need to be uniform random, and good thing: there would be no way to store it. It suffices for the hash function to be pairwise independent (any two particular hash values are independent). For our simple example, a random choice of the following four functions suffices.

abc

+++
+--
-+-
--+

I'll leave the new calculations to you.

回复收藏 0 原文

哭泣的笑容 2024-12-03 01:41:22

Count Sketch 是一种概率数据结构，它允许您回答以下问题：

读取元素流a1< /code>, a2, a3, ..., an 其中可能有许多重复元素，您可以在以下位置找到以下问题的答案任何时间：到目前为止，您见过多少个 ai 元素？

只需维护从 ai 到到目前为止您所看到的元素的数量。记录新观测值的成本为 O(1)，检查给定元素的观测值计数也是如此。但是，存储此映射需要花费 O(n) 空间，其中 n 是不同元素的数量。

Count Sketch 将如何帮助您？与所有概率数据结构一样，您会牺牲确定性来换取空间。 Count Sketch 允许您选择两个参数：结果的准确性 (ε) 和错误估计的概率 (δ)。

为此，您需要选择一系列 d 成对独立的哈希函数。这些复杂的单词意味着它们不会经常发生冲突（事实上，如果两个哈希值都将值映射到空间[0, m]，那么冲突的概率大约为1/m^2< /em>）。每个哈希函数都将值映射到空间 [0, w]，因此您可以创建一个 d * w 矩阵。

当您读取该元素时，您将计算该元素的每个 d 哈希值并更新草图中的相应值。这部分对于计数草图和计数分钟草图是相同的。

Insomniac 很好地解释了 Count Sketch 的想法（计算期望值），所以我只想指出，使用 Count-min Sketch 一切都变得更加简单。您只需计算想要获取的值的 d 个哈希值并返回其中最小的值。令人惊讶的是，这提供了强大的准确性和概率保证，您可以在这里找到。

增加哈希函数的范围可以提高结果的准确性，同时增加哈希值的数量可以降低错误估计的概率：
ε = e/w 和δ=1/e^d。另一个有趣的事情是，该值总是被高估（如果您找到该值，它很可能大于实际值，但肯定不会小于）。

Count Sketch is a probabilistic data structure which allows you to answer the following question:

Reading a stream of elements a1, a2, a3, ..., an where there can be many repeated elements, you the answer to the following question at any time: How many ai elements have you seen so far?

You can clearly get an exact answer at any time just by maintaining the mapping from ai to the count of those elements you've seen so far. Recording new observations costs O(1), as does checking the observed count for a given element. However, it costs O(n) space to store this mapping, where n is the number of distinct elements.

How is Count Sketch is going to help you? As with all probabilistic data structures you sacrifice certainty for space. Count Sketch allows you to select two parameters: accuracy of the results (ε) and probability of bad estimate (δ).

To do this you select a family of d pairwise-independent hash functions. These complicated words mean that they do not collide too often (in fact, if both hashes map values onto space [0, m] then the probability of collision is approximately 1/m^2). Each of these hash functions maps the values to a space [0, w], so you create a d * w matrix.

When you read the element, you calculate each of d hashes of this element and update the corresponding values in the sketch. This part is the same for Count Sketch and Count-min Sketch.

Insomniac nicely explained the idea (calculating expected value) for Count Sketch, so I will just note that with Count-min Sketch everything is even simpler. You just calculate d hashes of the value you want to get and return the smallest of them. Surprisingly this provides a strong accuracy and probability guarantee, which you can find here.

Increasing the range of the hash functions increases the accuracy of results, while increasing the number of hashes decreases the probability of bad estimate:
ε = e/w and δ=1/e^d. Another interesting thing is that the value is always overestimated (if you found the value, it is most probably bigger than the real value, but surely not smaller).

回复收藏 0 原文

~没有更多了~