处理应该是唯一的值上几乎不可能的冲突
有许多系统依赖于某些特定值的唯一性。 我们会想到任何使用 GUID 的东西(例如 Windows 注册表或其他数据库),但也会想到从对象创建散列来识别它的东西,因此需要该散列是唯一的。
哈希表通常不介意两个对象是否具有相同的哈希,因为哈希只是用于将对象分解为类别,因此在查找时,不会查找表中的所有对象,而只会查找同一类别中的那些对象(桶)必须与搜索对象进行比较以确定其身份。
然而,其他实现(似乎)取决于独特性。 我的例子(这就是我提出这个问题的原因)是 Mercurial 的修订 ID。 Mercurial 邮件列表中的条目正确指出
变更集哈希的几率 第一次意外碰撞 十亿次提交基本上为零。 但 如果发生这种情况我们会注意到的。 和 你会因为这样的人而出名 意外破坏了 SHA1。
但即使是最小的可能性也不意味着不可能。 现在,我不想解释为什么完全可以依赖唯一性(这已经讨论过 此处 例如)。 这对我来说非常清楚。
相反,我想知道(也许通过您自己工作中的示例):
是否有任何最佳实践来涵盖这些不可能的情况?
是否应该忽略它们,因为特别强的太阳风更有可能导致硬盘读取错误?
是否应该忽略它们,因为是否应该至少对它们进行测试,即使只是失败,并向用户显示“我放弃,你已经完成了不可能的事情”?
或者这些情况是否应该得到妥善处理?
对我来说,尤其是以下内容很有趣,尽管它们有些敏感:
如果你不处理这些情况,你会如何应对不听从概率的直觉?
如果你确实处理过它们,考虑到还有更多可能的情况你没有处理过,比如超非婆婆,你如何(对自己和他人)证明这项工作的合理性?
There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.
A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.
Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states
The odds of the changeset hash
colliding by accident in your first
billion commits is basically zero. But
we will notice if it happens. And
you'll get to be famous as the guy who
broke SHA1 by accident.
But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.
Rather, I'd like to know (maybe by means of examples from your own work):
Are there any best practices as to covering these improbable cases anyway?
Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?
Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?
Or should even these cases get handled gracefully?
For me, especially the following are interesting, although they are somewhat touchy-feely:
If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
答案是您不是在测试是否发现偶然发生的 GUID 冲突。 您正在测试是否发现由于 GUID 代码中的错误而发生的 GUID 冲突,或者您违反了 GUID 代码所依赖的先决条件(或被某些攻击者诱骗违反),例如在 V1 中,MAC地址是唯一的,并且时间会向前推移。 两者中的任何一个都比基于超新星的错误更有可能。
但是,并非 GUID 代码的每个客户端都应该测试其正确性,尤其是在生产代码中。 这就是单元测试应该做的事情,因此要权衡错过实际使用会捕获但单元测试没有捕获的错误的成本,以及始终对库进行事后猜测的成本。
另请注意,GUID 仅在生成 GUID 的每个人都合作的情况下才起作用。 如果您的应用程序在您控制的计算机上生成 ID,那么您可能根本不需要 GUID - 本地唯一的 ID(例如递增计数器)可能会满足您的要求。 显然,Mercurial 无法使用它,因此它使用哈希值,但最终 SHA-1 将遭受产生冲突(或者更糟糕的是,原像)的攻击,并且它们必须进行更改。
如果您的应用程序在您无法控制的计算机(例如客户端)上生成非哈希“GUID”,那么就不用担心意外冲突了,您会担心恶意客户端试图对您的服务器进行 DOS 造成的故意冲突。 无论如何,保护自己免受这种情况的影响可能会保护您免受事故的影响。
这个问题的答案可能是“不”。 如果您可以像哈希表一样优雅地处理 GUID 冲突,那么为什么还要为 GUID 烦恼呢? “标识符”的全部意义在于,如果两个事物具有相同的 ID,那么它们就是相同的。 如果您不想以相同的方式对待它们,只需首先像哈希表一样将它们定向到存储桶中,然后使用不同的方案(如哈希)。
The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.
However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.
Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.
If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.
The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).
给定一个好的 128 位哈希,与给定随机输入的特定哈希值发生冲突的可能性是:
1 / 2 ** 128
,大约等于3 * 10 ** -39
。可以使用用于解释 生日问题。
其中
!
表示阶乘函数。 然后,我们可以绘制随着样本数量增加而没有发生碰撞的概率:随机概率随着样本数量的增加,SHA-1 发生冲突。 http://img21.imageshack.us/img21/9186/sha1collision.png
在
10**17
和10**18
之间我们开始看到哈希值与10**19
哈希值发生碰撞的可能性从 0.001% 到 0.14%,最后达到 13%。 因此,在一个拥有一百万、十亿条记录的系统中,依赖唯一性可能是不明智的(这样的系统是可以想象的),但在绝大多数系统中,冲突的概率非常小,您可以依赖哈希值的唯一性出于所有实际目的。现在,抛开理论不谈,碰撞更有可能是通过错误或某人攻击您的系统而引入您的系统,因此 onebyone 的答案提供了检查碰撞的充分理由,即使意外碰撞的可能性微乎其微(即也就是说,错误或恶意的可能性比意外碰撞的可能性要高得多)。
Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:
1 / 2 ** 128
which is approximately equal to3 * 10 ** -39
.The probability of seeing no collisions (
p
) givenn
samples can be computed using the logic used to explain the birthday problem.where
!
denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png
Between
10**17
and10**18
hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with10**19
hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).