Java 中用于文本字符串的良好 64 位哈希函数是什么?

发布于 2024-08-10 07:19:22 字数 195 浏览 5 评论 0原文

我正在寻找一个散列函数,它:

  1. 很好地散列文本字符串(例如很少冲突)
  2. 用Java编写,并且广泛使用
  3. 优点:适用于多个字段(而不是我连接它们并应用散列)在连接的字符串上)
  4. 额外:有一个 128 位变体。
  5. 优点:不占用 CPU 资源。

I'm looking for a hash function that:

  1. Hashes textual strings well (e.g. few collisions)
  2. Is written in Java, and widely used
  3. Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
  4. Bonus: Has a 128-bit variant.
  5. Bonus: Not CPU intensive.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

呆橘 2024-08-17 07:19:22

为什么不使用默认 String.hashCode()long 变体(其中一些非常聪明的人肯定会努力提高其效率 - 更不用说成千上万的开发人员的眼睛已经看过这段代码)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

如果您正在寻找更多位,您可能可以使用BigInteger
编辑:

正如我在对 @brianegge 的答案的评论中提到的,超过 32 位的哈希值没有太多用例,并且很可能没有一个超过 64 位的哈希值:

我可以想象一个分布在数十台服务器上的巨大哈希表,可能存储数百亿个映射。对于这种情况,@brianegge 在这里仍然有一个有效的观点:32 位允许 2^32(约 43 亿)个不同的哈希键。假设有一个强大的算法,您应该仍然会遇到很少的冲突。使用 64 位(18,446,744,073 亿个不同的密钥),无论您需要什么疯狂的场景,您都一定可以节省成本。不过,考虑 128 位密钥(340,282,366,920,938,463,463,374,607,4310 亿个可能的密钥)的用例几乎是不可能的。

要组合多个字段的哈希值,只需执行异或将一乘以质数并将它们相加:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

小质数在那里以避免交换值的相同哈希码,即{'foo',' bar'} 和 {'bar','foo'} 不相等,应该有不同的哈希码。 XOR 是不好的,因为如果两个值相等它会返回 0。因此,{'foo','foo'} 和 {'bar','bar'} 将具有相同的哈希码。

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

If you're looking for even more bits, you could probably use a BigInteger
Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

娇纵 2024-08-17 07:19:22

今天(2018)的答案。 SipHash。

它将比这里的大多数答案快得多,并且质量明显高于所有答案。

Guava 库有一个: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

An answer for today (2018). SipHash.

It will be much faster than most of the answers here, and significantly higher quality than all of them.

The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

半城柳色半声笛 2024-08-17 07:19:22

创建 SHA-1 哈希,然后屏蔽最低 64 位。

Create an SHA-1 hash and then mask out the lowest 64bits.

埖埖迣鎅 2024-08-17 07:19:22
long hash = string.hashCode();

是的,前 32 位将为 0,但在遇到哈希冲突问题之前,您可能会耗尽硬件资源。 String 中的 hashCode 非常高效且经过充分测试。

更新
我认为上面的内容满足了可能工作的最简单的事情,但是,我同意 @sfussenegger 扩展现有 String hashCode 的想法。

除了为字符串提供良好的 hashCode 之外,您可能还需要考虑在实现中重新散列哈希代码。如果您的存储被其他开发人员使用,或与其他类型一起使用,这可以帮助分发您的密钥。例如,Java的HashMap是基于二次幂长度的哈希表,因此它添加了这个功能来保证低位得到充分分布。

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
long hash = string.hashCode();

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.

Update
I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.

In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
澜川若宁 2024-08-17 07:19:22

为什么不使用 CRC64 多项式。这些是相当高效和优化的,以确保所有位都被计数并分布在结果空间上。

如果你用谷歌搜索“CRC64 Java”,网上有很多可用的实现

Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.

There are plenty of implementations available on the net if you google "CRC64 Java"

猫性小仙女 2024-08-17 07:19:22

做这样的事情:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test {

    public static void main(String[] args) throws NoSuchAlgorithmException,
            IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);

        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            SomeObject testObject = new SomeObject();

            dos.writeInt(testObject.count);
            dos.writeLong(testObject.product);
            dos.writeDouble(testObject.stdDev);
            dos.writeUTF(testObject.name);
            dos.writeChar(testObject.delimiter);
            dos.flush();

            byte[] hashBytes = md.digest(baos.toByteArray());
            BigInteger testObjectHash = new BigInteger(hashBytes);

            System.out.println("Hash " + testObjectHash);
        } finally {
            dos.close();
        }
    }

    private static class SomeObject {
        private int count = 200;
        private long product = 1235134123l;
        private double stdDev = 12343521.456d;
        private String name = "Test Name";
        private char delimiter = '\n';
    }
}

DataOutputStream 让你编写原语和字符串并将它们作为字节输出。将 ByteArrayOutputStream 包装在其中将使您写入字节数组,它与 MessageDigest< 很好地集成/a>.您可以从此处。

最后 BigInteger 将使您将输出字节转换为更易于使用的数字。 MD5 和 SHA1 算法都会生成 128 位哈希值,因此如果您需要 64 位哈希值,则可以直接截断。

SHA1 应该可以很好地散列几乎所有内容,并且很少发生冲突(它是 128 位)。这适用于 Java,但我不确定它是如何实现的。实际上可能相当快。它适用于我的实现中的多个字段:只需将它们全部推送到 DataOutputStream 上即可。您甚至可以使用反射和注释来完成此操作(也许 @HashComponent(order=1) 来显示哪些字段进入哈希以及以什么顺序)。它有一个 128 位变体,我想您会发现它使用的 CPU 没有您想象的那么多。

我已经使用这样的代码来获取巨大数据集(现在可能有数十亿个对象)的哈希值,以便能够将它们分片到许多后端存储中。它应该可以满足您的任何需要。请注意,我认为您可能只想调用 MessageDigest.getInstance() 一次,然后从那时起调用 clone():IIRC 克隆速度要快得多。

Do something like this:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test {

    public static void main(String[] args) throws NoSuchAlgorithmException,
            IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);

        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            SomeObject testObject = new SomeObject();

            dos.writeInt(testObject.count);
            dos.writeLong(testObject.product);
            dos.writeDouble(testObject.stdDev);
            dos.writeUTF(testObject.name);
            dos.writeChar(testObject.delimiter);
            dos.flush();

            byte[] hashBytes = md.digest(baos.toByteArray());
            BigInteger testObjectHash = new BigInteger(hashBytes);

            System.out.println("Hash " + testObjectHash);
        } finally {
            dos.close();
        }
    }

    private static class SomeObject {
        private int count = 200;
        private long product = 1235134123l;
        private double stdDev = 12343521.456d;
        private String name = "Test Name";
        private char delimiter = '\n';
    }
}

DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.

Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.

SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe @HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.

I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.

苏璃陌 2024-08-17 07:19:22

反转字符串以获得另一个 32 位哈希码,然后将两者组合:

String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;

这是伪代码; String.reverse() 方法不存在,需要通过其他方式实现。

Reverse the string to get another 32-bit hashcode and then combine the two:

String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;

This is pseudocode; the String.reverse() method doesn't exist and will need to be implemented some other way.

苯莒 2024-08-17 07:19:22

你看看 Apache commons lang< /a>?

但对于 64 位(和 128),您需要一些技巧:Joshua Bloch 所著的《Effective Java》一书中列出的规则可帮助您轻松创建 64 位哈希(只需使用 long 而不是 int)。对于 128 位,你需要额外的技巧......

Do you look at Apache commons lang?

But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...

青朷 2024-08-17 07:19:22

我建议使用 mzHash64,这是一个非常简单、快速的函数,其碰撞次数非常接近理想的通用哈希函数

public static long mzHash64(byte[] data, int start, int length, long seed) {    
    long hash = 0xD45E69F901E72147L ^ seed;

    for(int i = 0; i < length; i++)
        hash = 0x3631754B22FF2D5CL * (i + data[start + i]) ^ (hash << 2) ^ (hash >>> 2);

    return hash;
}

public static long mzHash64(String str) {
    byte[] data = str.getBytes();
    return mzHash64(data, 0, data.length, 0);
}

来源: https://github.com/matteo65/mzHash64

32位版本:https://github.com /matteo65/mzHash32

I suggest using mzHash64, a very simple, fast function with a number of collisions very close to an ideal Universal Hash Function

public static long mzHash64(byte[] data, int start, int length, long seed) {    
    long hash = 0xD45E69F901E72147L ^ seed;

    for(int i = 0; i < length; i++)
        hash = 0x3631754B22FF2D5CL * (i + data[start + i]) ^ (hash << 2) ^ (hash >>> 2);

    return hash;
}

public static long mzHash64(String str) {
    byte[] data = str.getBytes();
    return mzHash64(data, 0, data.length, 0);
}

Source: https://github.com/matteo65/mzHash64

32 bit version: https://github.com/matteo65/mzHash32

草莓酥 2024-08-17 07:19:22

免责声明:如果您希望有效地散列单个自然语言单词,则适用此解决方案。它对于散列较长的文本或包含非字母字符的文本效率很低。

我不知道有什么函数,但这里有一个可能有帮助的想法:

  • 专用 64 位中的 52 位来表示哪些字母出现在细绳。例如,如果存在“a”,则设置位[0],对于“b”设置位1,对于“A”设置位[26]。这样,只有包含完全相同的一组字母的文本才会具有相同的“签名”。

然后,您可以使用剩余的 12 位对字符串长度(或其模值)进行编码,以进一步减少冲突,或使用传统哈希函数生成 12 位 hashCode。

假设您的输入仅是文本,我可以想象这会导致很少的冲突,并且计算成本低廉(O(n))。 与迄今为止的其他解决方案不同,此方法考虑了问题域以减少冲突 - 它基于《Programming Pearls》中描述的 Anagram Detector(请参阅 此处)。

DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.

I'm not aware of a function but here's an idea that might help:

  • Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".

You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.

Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文