在面试中使用一个好的哈希函数来处理整数、字符串?
我在面试中遇到过需要对整数或字符串使用哈希函数的情况。在这种情况下我们应该选择哪些呢?在这些情况下我错了,因为我最终选择了那些产生大量冲突的函数,但散列函数往往是数学的,你无法在面试中记住它们。是否有任何一般性建议,以便面试官至少对您的整数或字符串输入方法感到满意?哪些功能足以满足“面试情况”中的两种输入
I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是来自 Effective java page 33:
equals方法,即),执行以下操作:
然后按照步骤 2.1.iii 对结果进行哈希处理。
通过递归调用 equals 来比较字段,递归地
在字段上调用 hashCode。如果更复杂的比较是
需要,计算该字段的“规范表示”并
在规范表示上调用 hashCode。如果值
字段为 null,返回 0(或其他一些常量,但 0 是传统的)。
48 第 3 章所有对象通用的方法
也就是说,通过应用计算每个重要元素的哈希码
递归地执行这些规则,并根据步骤 2.b 组合这些值。如果每一个
数组字段中的元素很重要,您可以使用其中之一
1.5 版本中添加了 Arrays.hashCode 方法。
结果 = 31 * 结果 + c;
相同的实例具有相同的哈希码。编写单元测试来验证您的直觉!
如果相等的实例具有不相等的哈希码,请找出原因并解决问题。
Here is a simple recipe from Effective java page 33:
equals method, that is), do the following:
then hash the resulting long as in step 2.1.iii.
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
result = 31 * result + c;
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.
您应该询问面试官哈希函数的用途 - 这个问题的答案将决定哪种哈希函数是合适的。
如果它用于哈希映射等哈希数据结构,您希望它尽可能简单(执行速度快)并避免冲突(最常见的值映射到不同的哈希值)。一个很好的例子是对同一个整数进行整数哈希 - 这是 java.lang.Integer 中的标准 hashCode() 实现
如果是出于安全目的,您将需要使用 加密哈希函数。这些的主要设计目的是为了很难反转哈希函数或发现冲突。
如果您想要快速的伪随机哈希值(例如用于模拟),那么您通常可以修改伪随机数生成器来创建这些值。我个人最喜欢的是:
如果您正在计算某种形式的复合结构(例如,具有多个字符的字符串、数组或具有多个字段的对象)的散列,则可以使用多种技术来创建组合散列函数。我建议对组成部分的旋转哈希值进行异或,例如:
请注意,上述内容在加密上并不安全,但可用于大多数其他目的。显然,您会遇到冲突,但是当将大型结构哈希为整数时,这是不可避免的:-)
You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)
也许不是。需要向您不知道其实现的哈希表提供哈希函数的情况并不罕见。此外,如果您以取决于使用质数存储桶的实现的方式进行散列,那么如果由于新的库、编译器、操作系统端口等而导致实现发生变化,您的性能可能会下降。
就我个人而言,我认为重要的是面试是对通用哈希算法的理想特征的清晰理解,基本上是对于任何两个值相差仅一位的输入键,输出中的每一位都有大约 50/50 的机会翻转。我发现这非常违反直觉,因为我第一次看到的许多散列函数都使用位移位和异或,并且翻转的输入位通常会翻转一个输出位(通常在另一个位位置,因此 1-输入位影响许多当我在 Knuth 的一本书中读到 -output-bits 时,您至少能够测试和评估特定的实现,无论它们是如何实现的,
因为它是一种方法 。实现这一理想并且很容易记住,尽管内存使用可能使其比数学方法慢(也可能更快,具体取决于硬件),例如简单地使用输入中的每个字节来查找随机整数表。 ,给定 24 位 RGB 值和
int table[3][256]
,table[0][r] ^ table[1][g] ^ table[2][b ]
是一个很棒的sizeof int
哈希值 - 如果输入随机分散在int
值中(而不是说递增 - 见下文),那么确实“完美”。这种方法对于长键或任意长度的键来说并不理想,尽管您可以开始重新访问表并对值进行位移位等。尽管如此,您有时可以比这种随机方法做得更好您知道输入键中的模式和/或所涉及的存储桶数量的特定情况(例如,您可能知道输入键从 1 到 100 是连续的,并且有 128 个存储桶,因此您可以通过没有任何碰撞)。然而,如果输入不再满足您的期望,您可能会遇到可怕的碰撞问题,而“随机”方法永远不会比 load (size() / buckets) 所暗示的情况更糟糕。另一个有趣的见解是,当您想要快速而平庸的散列时,您不必在生成散列时合并所有输入数据:例如,上次我查看 Visual C++ 的字符串散列代码时,它选择了均匀间隔的十个字母沿着文本用作输入......
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and
int table[3][256]
,table[0][r] ^ table[1][g] ^ table[2][b]
is a greatsizeof int
hash value - indeed "perfect" if inputs are randomly scattered through theint
values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....