哈希函数组合——碰撞风险是否显着降低?

发布于 2024-08-02 22:52:37 字数 379 浏览 10 评论 0原文

有谁知道通过组合哈希函数来降低冲突概率是否有真正的好处?我特别需要了解 32 位哈希,即结合 Adler32 和 CRC32。 基本上,adler32(crc32(data)) 产生的碰撞概率会比 crc32(data) 更小吗? 最后一条评论此处给出了一些支持合并的测试结果,但没有提及来源。 就我的目的而言,碰撞并不重要(即任务不涉及安全),但如果可能的话,我宁愿尽量减少概率。 PS:我刚刚开始进入奇妙的哈希世界,并阅读了大量相关内容。抱歉,如果我问了一个愚蠢的问题,我什至还没有获得正确的“哈希方言”,可能我对此的谷歌搜索也很糟糕。 谢谢。

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我爱人 2024-08-09 22:52:37

像这样将它们串联起来是没有意义的。您正在将一个 32 位空间散列到另一个 32 位空间。

在第一步发生crc32碰撞的情况下,最终的结果仍然是碰撞。然后在 adler32 步骤中添加任何潜在的冲突。所以它不可能变得更好,只能是一样或更糟。

为了减少冲突,您可以尝试独立使用两个哈希值来创建 64 位输出空间:

adler32(data) << 32 | 32 crc32(data)

这样做是否有显着的好处,我不确定。

请注意,您提到的原始评论是独立存储哈希值:

无论你使用哪种算法
会有一些错误的机会
积极的一面。但是,您可以减少
这些机会有相当大的差距
通过使用两种不同的散列
算法。如果你要计算
并存储 CRC32 和
Alder32 对于每个 url,a 的几率
两个哈希同时发生冲突
对于任何给定的 url 对来说
减少。

当然,这意味着存储两倍
许多信息是其中的一部分
你原来的问题。然而,有
是存储两组哈希值的一种方式
数据,使其需要最少的
内存(10kb左右)同时给予
几乎相同的查找性能(15
微秒/查找与 5 相比
微秒)作为 Perl 的哈希值。

This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.

In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.

To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:

adler32(data) << 32 | crc32(data)

Whether there is significant benefit in doing that, I'm not sure.

Note that the original comment you referred to was storing the hashes independently:

Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.

Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文