Shingleprinting在实践中如何运作?
我正在尝试使用 shingleprinting 来测量文档相似性。该过程涉及以下步骤:
- 创建一个 5-shingling 两个文档 D1、D2
- 用 64 位散列对每个 shingle 进行散列
- 选择从 0 到 2^64-1 的数字的随机排列并应用于 shingle 散列
- 对于每个文档,找到结果值中的最小值
- (如果它们匹配)将其视为正例,如果不将其视为负例,则
- 重复 3. 至 5. 几次
- 使用
positive_examples /total Examples
作为相似性度量
步骤 3 涉及生成非常随机的排列长序列。使用 Knuth-shuffle 似乎是不可能的。这有什么捷径吗?请注意,最终我们只需要结果排列中的一个元素。
I'm trying to use shingleprinting to measure document similarity. The process involves the following steps:
- Create a 5-shingling of the two documents D1, D2
- Hash each shingle with a 64-bit hash
- Pick a random permutation of the numbers from 0 to 2^64-1 and apply to shingle hashes
- For each document find the smallest of the resulting values
- If they match count it as a positive example, if not count it as a negative example
- Repeat 3. to 5. a few times
- Use
positive_examples / total examples
as the similarity measure
Step 3 involves generating a random permutation of a very long sequence. Using a Knuth-shuffle seems out of the question. Is there some shortcut for this? Note that in the end we need only a single element of the resulting permutation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
警告:我对此并不百分百肯定,但我读过一些论文,我相信这就是它的工作原理。例如,在 Piotr Indyk 的“一个小型的近似最小独立的哈希函数族”中,他写道“在与 Altavista 集成的实现中,集合 H 被选择为成对独立的哈希函数族”。
在步骤 3 中,您实际上不需要对 [n](从 1 到 n 的整数)进行随机排列。事实证明,成对独立的哈希函数在实践中是有效的。所以你要做的就是选择一个成对独立的哈希函数 h。然后将 h 应用于每个 shingle 哈希值。您可以在步骤 4 中取这些值的最小值。
标准的成对独立哈希函数为 h(x) = ax + b (mod p),其中 a 和 b 是随机选择的,p 是素数。
参考文献: http://www.cs.princeton.edu/courses /archive/fall08/cos521/hash.pdf 和 http://people. csail.mit.edu/indyk/minwise99.ps
Warning: I'm not 100% positive about this, but I've read some of the papers and I believe this is how it works. For instance, in "A small approximately min-wise independent family of hash functions" by Piotr Indyk, he writes "In the implementation integrated with Altavista, the set H was chosen to be a pairwise independent family of hash functions."
In step 3, you don't actually need a random permutation on [n] (the integers from 1 to n). It turns out that a pairwise-independent hash function works in practice. So what you do is pick a pairwise-independent hash function h. And then apply h to each of the shingle hashes. You can take the min of those values in step 4.
A standard pairwise-independent hash function is h(x) = ax + b (mod p), where a and b are chosen randomly and p is a prime.
References: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf and http://people.csail.mit.edu/indyk/minwise99.ps