使用哈希检查字符串匹配,无需仔细检查整个字符串
我正在尝试尽快检查两个字符串是否相同。我可以在不比较整个字符串的情况下保护自己免受哈希冲突吗?
我有一个由字符串作为键控的项目的缓存。我存储字符串的哈希值、字符串的长度以及字符串本身。 (我目前正在使用 djb2 生成哈希值。)
检查是否输入字符串与缓存中的项目匹配,我计算输入的哈希,并将其与存储的哈希进行比较。如果匹配,我将输入的长度(作为计算哈希的副作用而得到)与存储的长度进行比较。最后,如果匹配,我会对输入和存储的字符串进行完整的字符串比较。
是否有必要进行完整的字符串比较?例如,是否有一种字符串哈希算法可以在数学上保证没有两个相同长度的字符串会生成相同的哈希值?如果不是,算法能否保证如果前 N 个字符中的任何一个不同,两个相同长度的不同字符串将生成不同的哈希码?
基本上,任何在字符串不同时提供 O(1) 性能但在匹配时优于 O(n) 性能的字符串比较方案都将比我现在所做的有所改进。
I'm trying to check if two strings are identical as quickly as possible. Can I protect myself from hash collisions without also comparing the entire string?
I've got a cache of items that are keyed by a string. I store the hash of the string, the length of the string, and the string itself. (I'm currently using djb2 to generate the hash.)
To check if an input string is a match to an item in the cache, I compute the input's hash, and compare it to the stored hash. If that matches, I compare the length of the input (which I got as a side effect of computing the hash) to the stored length. Finally, if that matches, I do a full string comparison of the input and the stored string.
Is it necessary to do that full string comparison? For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash? If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Basically, any string comparison scheme that offers O(1) performance when the strings differ but better than O(n) performance when they match would be an improvement over what I'm doing now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不,也不可能有。想一想:散列的长度是有限的,但字符串却没有。为了便于论证,假设哈希值是 32 位。您可以创建超过 20 亿个相同长度的唯一字符串吗?当然可以 - 您可以创建无限数量的唯一字符串,因此比较哈希值不足以保证唯一性。这个论点可以扩展到更长的哈希值。
嗯,是的,只要散列中的位数与字符串中的位数一样多,但这可能不是您正在寻找的答案。
用于循环冗余校验的一些算法具有保证,例如如果恰好有一位不同,那么 CRC 保证在一定的位运行长度上不同,但这仅适用于相对较短的运行。
No, and there can't be. Think about it: The hash has a finite length, but the strings do not. Say for argument's sake that the hash is 32-bits. Can you create more than 2 billion unique strings with the same length? Of course you can - you can create an infinite number of unique strings, so comparing the hashes is not enough to guarantee uniqueness. This argument scales to longer hashes.
Well, yes, as long as the number of bits in the hash is as great as the number of bits in the string, but that's probably not the answer you were looking for.
Some of the algorithms used for cyclic redundancy checks have guarantees like if there's exactly one bit different then the CRC is guaranteed to be different over a certain run length of bits, but that only works for relatively short runs.
如果您使用现代哈希函数(例如安全哈希算法 (SHA)< 之一),您应该不会发生冲突。 /a> 变体。
You should be safe from collisions if you use a modern hashing function such as one of the Secure Hash Algorithm (SHA) variants.