Java 中用于文本字符串的良好 64 位哈希函数是什么?
我正在寻找一个散列函数,它:
- 很好地散列文本字符串(例如很少冲突)
- 用Java编写,并且广泛使用
- 优点:适用于多个字段(而不是我连接它们并应用散列)在连接的字符串上)
- 额外:有一个 128 位变体。
- 优点:不占用 CPU 资源。
I'm looking for a hash function that:
- Hashes textual strings well (e.g. few collisions)
- Is written in Java, and widely used
- Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
- Bonus: Has a 128-bit variant.
- Bonus: Not CPU intensive.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
为什么不使用默认
String.hashCode()
的long
变体(其中一些非常聪明的人肯定会努力提高其效率 - 更不用说成千上万的开发人员的眼睛已经看过这段代码)?如果您正在寻找更多位,您可能可以使用BigInteger
编辑:
正如我在对 @brianegge 的答案的评论中提到的,超过 32 位的哈希值没有太多用例,并且很可能没有一个超过 64 位的哈希值:
要组合多个字段的哈希值,只需
执行异或将一乘以质数并将它们相加:小质数在那里以避免交换值的相同哈希码,即{'foo',' bar'} 和 {'bar','foo'} 不相等,应该有不同的哈希码。 XOR 是不好的,因为如果两个值相等它会返回 0。因此,{'foo','foo'} 和 {'bar','bar'} 将具有相同的哈希码。
Why don't you use a
long
variant of the defaultString.hashCode()
(where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?If you're looking for even more bits, you could probably use aBigInteger
Edit:
As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:
To combine the hash for several fields, simply
do an XORmultiply one with a prime and add them:The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.
今天(2018)的答案。 SipHash。
它将比这里的大多数答案快得多,并且质量明显高于所有答案。
Guava 库有一个: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--
An answer for today (2018). SipHash.
It will be much faster than most of the answers here, and significantly higher quality than all of them.
The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--
创建 SHA-1 哈希,然后屏蔽最低 64 位。
Create an SHA-1 hash and then mask out the lowest 64bits.
是的,前 32 位将为 0,但在遇到哈希冲突问题之前,您可能会耗尽硬件资源。 String 中的 hashCode 非常高效且经过充分测试。
更新
我认为上面的内容满足了可能工作的最简单的事情,但是,我同意 @sfussenegger 扩展现有 String hashCode 的想法。
除了为字符串提供良好的 hashCode 之外,您可能还需要考虑在实现中重新散列哈希代码。如果您的存储被其他开发人员使用,或与其他类型一起使用,这可以帮助分发您的密钥。例如,Java的HashMap是基于二次幂长度的哈希表,因此它添加了这个功能来保证低位得到充分分布。
Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.
Update
I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.
In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.
为什么不使用 CRC64 多项式。这些是相当高效和优化的,以确保所有位都被计数并分布在结果空间上。
如果你用谷歌搜索“CRC64 Java”,网上有很多可用的实现
Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.
There are plenty of implementations available on the net if you google "CRC64 Java"
做这样的事情:
DataOutputStream 让你编写原语和字符串并将它们作为字节输出。将 ByteArrayOutputStream 包装在其中将使您写入字节数组,它与 MessageDigest< 很好地集成/a>.您可以从此处。
最后 BigInteger 将使您将输出字节转换为更易于使用的数字。 MD5 和 SHA1 算法都会生成 128 位哈希值,因此如果您需要 64 位哈希值,则可以直接截断。
SHA1 应该可以很好地散列几乎所有内容,并且很少发生冲突(它是 128 位)。这适用于 Java,但我不确定它是如何实现的。实际上可能相当快。它适用于我的实现中的多个字段:只需将它们全部推送到
DataOutputStream
上即可。您甚至可以使用反射和注释来完成此操作(也许@HashComponent(order=1)
来显示哪些字段进入哈希以及以什么顺序)。它有一个 128 位变体,我想您会发现它使用的 CPU 没有您想象的那么多。我已经使用这样的代码来获取巨大数据集(现在可能有数十亿个对象)的哈希值,以便能够将它们分片到许多后端存储中。它应该可以满足您的任何需要。请注意,我认为您可能只想调用
MessageDigest.getInstance()
一次,然后从那时起调用clone()
:IIRC 克隆速度要快得多。Do something like this:
DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.
Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.
SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the
DataOutputStream
and you're good to go. You could even do it with reflection and annotations (maybe@HashComponent(order=1)
to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call
MessageDigest.getInstance()
once and thenclone()
from then on: IIRC the cloning is a lot faster.反转字符串以获得另一个 32 位哈希码,然后将两者组合:
这是伪代码;
String.reverse()
方法不存在,需要通过其他方式实现。Reverse the string to get another 32-bit hashcode and then combine the two:
This is pseudocode; the
String.reverse()
method doesn't exist and will need to be implemented some other way.你看看 Apache commons lang< /a>?
但对于 64 位(和 128),您需要一些技巧:Joshua Bloch 所著的《Effective Java》一书中列出的规则可帮助您轻松创建 64 位哈希(只需使用 long 而不是 int)。对于 128 位,你需要额外的技巧......
Do you look at Apache commons lang?
But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...
我建议使用 mzHash64,这是一个非常简单、快速的函数,其碰撞次数非常接近理想的通用哈希函数
来源: https://github.com/matteo65/mzHash64
32位版本:https://github.com /matteo65/mzHash32
I suggest using mzHash64, a very simple, fast function with a number of collisions very close to an ideal Universal Hash Function
Source: https://github.com/matteo65/mzHash64
32 bit version: https://github.com/matteo65/mzHash32
免责声明:如果您希望有效地散列单个自然语言单词,则适用此解决方案。它对于散列较长的文本或包含非字母字符的文本效率很低。
我不知道有什么函数,但这里有一个可能有帮助的想法:
然后,您可以使用剩余的 12 位对字符串长度(或其模值)进行编码,以进一步减少冲突,或使用传统哈希函数生成 12 位 hashCode。
假设您的输入仅是文本,我可以想象这会导致很少的冲突,并且计算成本低廉(O(n))。 与迄今为止的其他解决方案不同,此方法考虑了问题域以减少冲突 - 它基于《Programming Pearls》中描述的 Anagram Detector(请参阅 此处)。
DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.
I'm not aware of a function but here's an idea that might help:
You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.
Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).