我如何安全地假设 SHA1 哈希的一部分的唯一性?
我目前正在使用 SHA1 来稍微缩短 url:
Digest::SHA1.hexdigest("salt-" + url)
仅使用 SHA1 的前 8 个字符作为唯一标识符(就像 GitHub 显然对提交所做的那样)有多安全?
I'm currently using a SHA1 to somewhat shorten an url:
Digest::SHA1.hexdigest("salt-" + url)
How safe is it to use only the first 8 characters of the SHA1 as a unique identifier, like GitHub does for commits apparently?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
要计算给定长度和哈希数发生冲突的概率,请参阅生日问题。我不知道您将拥有多少哈希值,但这里有一些示例。 8 个十六进制字符是 32 位,因此对于大约 100 个哈希值,冲突的概率约为 1/1,000,000,对于 10,000 个哈希值,碰撞概率约为 1/100,对于 100,000 个哈希值,碰撞概率约为 3/4 等。
请参阅 维基百科上的生日攻击文章可找到满足您需求的良好哈希长度。例如,如果您希望对于超过 100,000 个哈希值的集合,冲突的可能性小于 1/1,000,000,000,则使用 64 位或 16 个十六进制数字。
这完全取决于您将拥有多少个哈希值以及您愿意接受的冲突概率是多少(因为总是存在一定的概率,即使概率非常小)。
To calculate the probability of a collision with a given length and the number of hashes that you have, see the birthday problem. I don't know the number of hashes that you are going to have, but here are some examples. 8 hexadecimal characters is 32 bits, so for about 100 hashes the probability of a collision is about 1/1,000,000, for 10,000 hashes it's about 1/100, for 100,000 it's 3/4 etc.
See the table in the Birthday attack article on Wikipedia to find a good hash length that would satisfy your needs. For example if you want the collision to be less likely than 1/1,000,000,000 for a set of more than 100,000 hashes then use 64 bits, or 16 hexadecimal digits.
It all depends on how many hashes are you going to have and what probability of a collision are you willing to accept (because there is always some probability, even if insanely small).
如果您谈论的是十六进制的 SHA-1,那么每个字符只能获得 4 位,总共 32 位。碰撞的几率与该最大值的平方根成反比,因此约为 1/65536。如果您的网址缩短器经常被使用,那么您可能很快就会开始看到冲突。
至于替代方案,最明显的可能就是只维护一个计数器。由于您需要存储一个 URL 表来将缩短的 URL 转换回原始 URL,因此您基本上只需将每个新 URL 存储在表中即可。如果它已经存在,则提供其现有编号。否则,您将其插入并为其指定一个新编号。无论哪种方式,您都将该号码提供给用户。
If you're talking about a SHA-1 in hexadecimal, then you're only getting 4 bits per character, for a total of 32 bits. The chances of a collision are inversely proportional to the square root of that maximum value, so about 1/65536. If your URL shortener gets used much, it probably won't take terribly long before you start to see collisions.
As for alternatives, the most obvious is probably to just maintain a counter. Since you need to store a table of URLs to translate your shortened URL back to the original, you basically just store each new URL in your table. If it was already present, you give its existing number. Otherwise, you insert it and give it a new number. Either way, you give that number to the user.
这取决于您想要实现的目标。 SHA1 的输出相对于输入来说实际上是随机的(一个好的哈希函数的输出会根据输入中的一位变化而改变一半的位,而 SHA1 虽然不完美,但相当不错),并且通过获取 160 位输出的 32 位(假设 8 个十六进制数字)子集,可以将输出空间从 2^160 值减少到 2^32 个值。在所有条件相同的情况下(事实并非如此),这将大大降低发现碰撞的难度。
但是,如果哈希函数的输入必须是有效的 URL,则会显着减少可能的输入数量。 @rsp 指出了生日问题,但考虑到这一点,我不确定它到底有多适用,至少在其简单的形式中是如此。此外,它很大程度上假设没有其他预防措施。
我更感兴趣的是你为什么要这样做。这是关于用户需要记住和输入的 URL 吗?如果是这样,添加一堆随机的十六进制数字可能是一个坏主意。它是一个 URL 还是仅以编程方式传递的 URL 参数?那么,我就不会太在意长度了。不管怎样,可能有更好的方法来完成您想要完成的任务。
It depends on what you are trying to accomplish. The output of SHA1 is effectively random with regards to the input (the output of a good hash function changes in half of its bits based on a one-bit change in the input, and SHA1, while not perfect, is pretty good), and by taking a 32-bit (assuming 8 hex digits) subset of the 160-bit output, you reduce the output space from 2^160 to 2^32 values. All things being equal, which they never are, this would significantly reduce the difficulty of finding a collision.
However, if the hash function's input must be a valid URL, that significantly reduces the number of possible inputs. @rsp points out the birthday problem, but given this, I'm not sure exactly how applicable it is at least in its simple form. Also, it largely assumes that there are no other precautions in place.
I would be more interested in why you are doing this. Is this about URLs that the user will need to remember and type? If so, tacking on a bunch of random hexadecimal digits is probably a bad idea. Is it a URL or URL parameter that will just be passed around programmatically? Then, I wouldn't care much about length. Either way, there are probably better ways to do what you are trying to accomplish.
如果您对 SHA1 使用二进制输出,并Base64对结果进行编码,每个字符你将获得更高的信息密度;您可以拥有相同的 8 个字符名称,但不仅仅是
16^8
(2^32
) 可能性,您将拥有64^8
代码> (2^48
) 可能性。假设 50% 碰撞概率与 1.177*sqrt(N),使用 Base64 样式编码需要比十六进制输出多 256 倍的输入才能达到 50% 的冲突概率。
If you use a binary output for SHA1 and Base64 encode the result, you will get much higher information density per character; you can have the same 8-character names, but rather than only
16^8
(2^32
) possibilities, you'll have64^8
(2^48
) possibilities.Using the assumption that the 50% probability-of-collision scales with 1.177*sqrt(N), using a Base64-style encoding will require 256 times more inputs than the hex-output before reaching the 50% chance of collision probability.