使用一个 64 位数字唯一标识 URL
这基本上是一个数学问题,但与编程非常相关:如果我有 10 亿个包含 URL 的字符串,并且我取每个字符串的 MD5 哈希值的前 64 位,我应该期望什么样的冲突频率?
如果我只有 1 亿个 URL,答案会如何变化?
在我看来,碰撞是极其罕见的,但这些事情往往令人困惑。
使用 MD5 以外的其他东西会更好吗? 请注意,我不是在寻找安全性,只是在寻找一个良好的快速哈希函数。 此外,MySQL 的本机支持也很好。
编辑:不完全重复
This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果 MD5 的前 64 位构成具有理想分布的散列,那么生日悖论仍然意味着每 2^32 个 URL 都会发生冲突。 换句话说,冲突的概率是 URL 的数量除以 4,294,967,296。 有关详细信息,请参阅 http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem。
仅仅丢弃 MD5 中的一半位我会感到不舒服; 最好对高位和低位 64 位字进行异或,以便让它们有机会混合。 话又说回来,MD5 绝不是快速或安全的,所以我根本不会为它操心。 如果您想要令人眼花缭乱的速度和良好的分发,但又不想假装安全,您可以尝试 64 位版本的 MurmurHash。 有关详细信息和代码,请参阅 http://en.wikipedia.org/wiki/MurmurHash。
If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.
您已将其标记为“生日悖论”,我想您已经知道答案。
在你的例子中,n 是 10 亿。
使用 MD5 以外的其他方法会更好一些,因为 MD5 存在实际共谋问题 。
You have tagged this as "birthday-paradox", I think you know the answer already.
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.
据我所知,您需要一个满足以下要求的哈希函数,将
哈希函数调查可能有助于深入找到最适合您的函数。
我建议从这里尝试多个函数,并根据您可能的输入集来描述它们(选择您认为会看到的数十亿个 URL)。
实际上,您可以为您的测试 URL 列表生成类似此测试调查的另一列来表征并从中进行选择您可能想要检查的现有或任何新的哈希函数(该表中的更多行)。 他们有 MSVC++ 源代码(参考 ZIP 链接< /a>)。
更改哈希函数以适合您的输出宽度(64 位)将为您的应用程序提供更准确的表征。
From what I see, you need a hash function with the following requirements,
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.
如果您有 2^n 种哈希可能性,则当您有 2^(n/2) 项时,发生冲突的可能性超过 50%。
例如,如果您的哈希值是 64 位,则您有 2^64 种哈希可能性,如果集合中有 2^32 个项目,则发生冲突的可能性为 50%。
If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.
仅使用哈希值,总是有可能发生冲突。 而且您事先并不知道您的网址列表中是否会发生一次或两次冲突,甚至数百次或数千次。
概率仍然只是概率。 就像掷骰子 10 次或 100 次,得到全 6 的机会是多少? 说概率很低,但还是有可能发生。 甚至可能连续很多次......
因此,虽然 生日悖论 向您展示了如何计算概率,您仍然需要决定碰撞是否可以接受。
...碰撞是可以接受的,哈希仍然是正确的方法; 找到一个 64 位哈希算法,而不是依赖于具有良好分布的“half-a-MD5”。 (虽然它可能有......)
Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)