在面试中使用一个好的哈希函数来处理整数、字符串？

发布于 2024-11-08 17:15:19 字数 165 浏览 5 评论 0原文

我在面试中遇到过需要对整数或字符串使用哈希函数的情况。在这种情况下我们应该选择哪些呢？在这些情况下我错了，因为我最终选择了那些产生大量冲突的函数，但散列函数往往是数学的，你无法在面试中记住它们。是否有任何一般性建议，以便面试官至少对您的整数或字符串输入方法感到满意？哪些功能足以满足“面试情况”中的两种输入

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧梦荧光笔 2024-11-15 17:15:19

这是来自 Effective java page 33：

将一些常量非零值（例如 17）存储在名为 result 的 int 变量中。
对于对象中的每个重要字段 f（由
equals方法，即），执行以下操作：
1. 计算该字段的 int 哈希码 c：
  - 如果该字段是布尔值，则计算 (f ? 1 : 0)。
  - 如果字段是 byte、char、short 或 int，则计算 (int) f。
  - 如果字段为 long，则计算 (int) (f ^ (f >>> 32))。
  - 如果字段是浮点数，则计算 Float.floatToIntBits(f)。
  - 如果该字段是双精度型，则计算 Double.doubleToLongBits(f)，并且
    然后按照步骤 2.1.iii 对结果进行哈希处理。
  - 如果该字段是一个对象引用并且该类的 equals 方法
    通过递归调用 equals 来比较字段，递归地
    在字段上调用 hashCode。如果更复杂的比较是
    需要，计算该字段的“规范表示”并
    在规范表示上调用 hashCode。如果值
    字段为 null，返回 0（或其他一些常量，但 0 是传统的）。
    48 第 3 章所有对象通用的方法
  - 如果该字段是一个数组，则将其视为每个元素都是一个单独的字段。
    也就是说，通过应用计算每个重要元素的哈希码
    递归地执行这些规则，并根据步骤 2.b 组合这些值。如果每一个
    数组字段中的元素很重要，您可以使用其中之一
    1.5 版本中添加了 Arrays.hashCode 方法。
2. 将步骤2.1计算出的哈希码c合并为结果，如下：
  结果 = 31 * 结果 + c;
返回结果。
当你写完 hashCode 方法后，问问自己是否
相同的实例具有相同的哈希码。编写单元测试来验证您的直觉！
如果相等的实例具有不相等的哈希码，请找出原因并解决问题。

回复收藏 0 原文

再见回来 2024-11-15 17:15:19

您应该询问面试官哈希函数的用途 - 这个问题的答案将决定哪种哈希函数是合适的。

如果它用于哈希映射等哈希数据结构，您希望它尽可能简单（执行速度快）并避免冲突（最常见的值映射到不同的哈希值）。一个很好的例子是对同一个整数进行整数哈希 - 这是 java.lang.Integer 中的标准 hashCode() 实现
如果是出于安全目的，您将需要使用加密哈希函数。这些的主要设计目的是为了很难反转哈希函数或发现冲突。
如果您想要快速的伪随机哈希值（例如用于模拟），那么您通常可以修改伪随机数生成器来创建这些值。我个人最喜欢的是：

public static final int hash(int a) {         
      a ^= (a << 13);
      a^=(a>>>17);        
      a ^= (a << 5);
      返回一个；   
}

如果您正在计算某种形式的复合结构（例如，具有多个字符的字符串、数组或具有多个字段的对象）的散列，则可以使用多种技术来创建组合散列函数。我建议对组成部分的旋转哈希值进行异或，例如：

public static <T> int hashCode(T[] data) {
    int result=0;
    for(int i=0; i<data.length; i++) {
        result^=data[i].hashCode();
        result=Integer.rotateRight(result, 1);
    }
    return result;
}

请注意，上述内容在加密上并不安全，但可用于大多数其他目的。显然，您会遇到冲突，但是当将大型结构哈希为整数时，这是不可避免的:-)

You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.

If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:

public static final int hash(int a) {         
      a ^= (a << 13);
      a ^= (a >>> 17);        
      a ^= (a << 5);
      return a;   
}

If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:

public static <T> int hashCode(T[] data) {
    int result=0;
    for(int i=0; i<data.length; i++) {
        result^=data[i].hashCode();
        result=Integer.rotateRight(result, 1);
    }
    return result;
}

Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

回复收藏 0 原文

欢烬 2024-11-15 17:15:19

对于整数，我通常使用 k % p，其中 p = 哈希表的大小，并且是素数；对于字符串，我从 String 类中选择哈希码。对于一家大型科技公司的面试来说这足够了吗？ – 菲尼克斯 2 天前

也许不是。需要向您不知道其实现的哈希表提供哈希函数的情况并不罕见。此外，如果您以取决于使用质数存储桶的实现的方式进行散列，那么如果由于新的库、编译器、操作系统端口等而导致实现发生变化，您的性能可能会下降。

就我个人而言，我认为重要的是面试是对通用哈希算法的理想特征的清晰理解，基本上是对于任何两个值相差仅一位的输入键，输出中的每一位都有大约 50/50 的机会翻转。我发现这非常违反直觉，因为我第一次看到的许多散列函数都使用位移位和异或，并且翻转的输入位通常会翻转一个输出位（通常在另一个位位置，因此 1-输入位影响许多当我在 Knuth 的一本书中读到 -output-bits 时，您至少能够测试和评估特定的实现，无论它们是如何实现的，

因为它是一种方法。实现这一理想并且很容易记住，尽管内存使用可能使其比数学方法慢（也可能更快，具体取决于硬件），例如简单地使用输入中的每个字节来查找随机整数表。，给定 24 位 RGB 值和 int table[3][256]，table[0][r] ^ table[1][g] ^ table[2][b ] 是一个很棒的 sizeof int 哈希值 - 如果输入随机分散在 int 值中（而不是说递增 - 见下文），那么确实“完美”。这种方法对于长键或任意长度的键来说并不理想，尽管您可以开始重新访问表并对值进行位移位等。

尽管如此，您有时可以比这种随机方法做得更好您知道输入键中的模式和/或所涉及的存储桶数量的特定情况（例如，您可能知道输入键从 1 到 100 是连续的，并且有 128 个存储桶，因此您可以通过没有任何碰撞）。然而，如果输入不再满足您的期望，您可能会遇到可怕的碰撞问题，而“随机”方法永远不会比 load (size() / buckets) 所暗示的情况更糟糕。另一个有趣的见解是，当您想要快速而平庸的散列时，您不必在生成散列时合并所有输入数据：例如，上次我查看 Visual C++ 的字符串散列代码时，它选择了均匀间隔的十个字母沿着文本用作输入......

For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago

Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..

Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.

One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..

All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

回复收藏 0 原文

~没有更多了~