是否可以组合私有成员的哈希码来生成新的哈希码?
我有一个对象,我想为其生成唯一的哈希值(覆盖 GetHashCode()),但我想避免溢出或不可预测的情况。
该代码应该是组合一小部分字符串的哈希码的结果。
哈希码将是生成缓存键的一部分,因此理想情况下它们应该是唯一的,但是被哈希的可能值的数量很小,所以我认为这里的概率对我有利。
这样的事情就足够了并且有更好的方法吗?
int hash = 0;
foreach(string item in collection){
hash += (item.GetHashCode() / collection.Count)
}
return hash;
编辑:感谢到目前为止的回答。 @Jon Skeet:不,顺序并不重要,
我想这几乎是另一个问题,但由于我使用结果来生成缓存密钥(字符串),使用 MD5 这样的加密哈希函数或仅使用字符串是否有意义这个 int 的表示?
I have an object for which I want to generate a unique hash (override GetHashCode()) but I want to avoid overflows or something unpredictable.
The code should be the result of combining the hash codes of a small collection of strings.
The hash codes will be part of generating a cache key, so ideally they should be unique however the number of possible values that are being hashed is small so I THINK probability is in my favour here.
Would something like this be sufficient AND is there a better way of doing this?
int hash = 0;
foreach(string item in collection){
hash += (item.GetHashCode() / collection.Count)
}
return hash;
EDIT: Thanks for answers so far.
@Jon Skeet: No, order is not important
I guess this is almost a another question but since I am using the result to generate a cache key (string) would it make sense to use a crytographic hash function like MD5 or just use the string representation of this int?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
马克和乔恩指出的基本面还不错,但就结果分布的均匀性而言,它们远非最佳。 遗憾的是,许多 Knuth 人复制的“乘以素数”方法并不是的最佳选择在许多情况下,可以通过更便宜的计算函数来实现更好的分布(尽管这在现代硬件上非常轻微)。 事实上,将素数投入哈希的许多方面不是万能药。
如果这些数据用于相当大的哈希表,我建议阅读 Bret Mulvey 对各种现代技术的出色研究和解释(而且不是那么现代)哈希技术可以用c#轻松完成。
请注意,各种散列函数的字符串的行为严重偏向于字符串是短(粗略地说,在位开始溢出之前对多少个字符进行散列)还是长。
最简单、最容易实现的方法之一也是最好的方法之一,即 Jenkins 一次一个哈希。
然后你可以像这样使用它:
你可以像这样合并多个不同的类型:
如果你只能将字段作为对象访问而不了解内部结构,你可以简单地对每个类型调用 GetHashCode() 并将该值组合起来,如下所示:
遗憾的是你不能执行 sizeof(T) 因此你必须单独执行每个结构。
如果您希望使用反射,您可以在每个类型的基础上构造一个函数,该函数对所有字段进行结构标识和散列。
如果您希望避免不安全的代码,那么您可以使用位掩码技术从整数(如果处理字符串则为字符)中提取各个位,而不会带来太多额外的麻烦。
The fundamentals pointed out by Marc and Jon are not bad but they are far from optimal in terms of their evenness of distribution of the results. Sadly the 'multiply by primes' approach copied by so many people from Knuth is not the best choice in many cases better distribution can be achieved by cheaper to calculate functions (though this is very slight on modern hardware). In fact throwing primes into many aspects of hashing is no panacea.
If this data is used for significantly sized hash tables I recommend reading of Bret Mulvey's excellent study and explanation of various modern (and not so modern) hashing techniques handily done with c#.
Note that the behaviour with strings of various hash functions is heavily biased towards wehther the strings are short (roughly speaking how many characters are hashed before the bits begin to over flow) or long.
One of the simplest and easiest to implement is also one of the best, the Jenkins One at a time hash.
you can then use this like so:
you can merge multiple different types like so:
If you only have access to the field as an object with no knowledge of the internals you can simply call GetHashCode() on each one and combine that value like so:
Sadly you can't do sizeof(T) so you must do each struct individually.
If you wish to use reflection you can construct on a per type basis a function which does structural identity and hashing on all fields.
If you wish to avoid unsafe code then you can use bit masking techniques to pull out individual bits from ints (and chars if dealing with strings) with not too much extra hassle.
哈希值并不意味着是唯一的 - 它们只是意味着在大多数情况下分布良好。 它们只是为了保持一致。 请注意,溢出不应该成为问题。
仅仅相加通常不是一个好主意,除法当然也不是一个好主意。 这是我通常使用的方法:
如果您处于已检查的上下文中,您可能想故意将其设为未检查。
请注意,这假设顺序很重要,即 { "a", "b" } 应该与 { "b", "a" } 不同。 如果情况并非如此,请告诉我们。
Hashes aren't meant to be unique - they're just meant to be well distributed in most situations. They're just meant to be consistent. Note that overflows shouldn't be a problem.
Just adding isn't generally a good idea, and dividing certainly isn't. Here's the approach I usually use:
If you're otherwise in a checked context, you might want to deliberately make it unchecked.
Note that this assumes that order is important, i.e. that { "a", "b" } should be different from { "b", "a" }. Please let us know if that's not the case.
只要您要组合其哈希码的成员遵循哈希码规则,这种方法就没有任何问题。 简而言之...
There is nothing wrong with this approach as long as the members whose hashcodes you are combining follow the rules of hash codes. In short ...
如果项目的顺序并不重要(即 {"a","b"} 与 {"b","a"} 相同),那么您可以使用互斥或来组合哈希代码:
[编辑:作为马克在对不同答案的评论中指出,这有一个缺点,即还为 {"a"} 和 {"a","b","b"} 等集合提供相同的哈希码。]
如果顺序很重要,您可以改为乘以一个质数并添加:(
当您相乘时,有时会出现被忽略的溢出,但是通过与质数相乘,您会丢失最少的信息。如果您与 16 这样的数字相乘,每次您都会丢失四位信息,因此在八个项目之后,第一个项目的哈希码将完全消失。)
If the order of the items is not important (i.e. {"a","b"} is the same as {"b","a"}) then you can use exclusive or to combine the hash codes:
[Edit: As Mark pointed out in a comment to a different answer, this has the drawback of also give collections like {"a"} and {"a","b","b"} the same hash code.]
If the order is important, you can instead multiply by a prime number and add:
(When you multiply you will sometimes get an overflow that is ignored, but by multiplying with a prime number you lose a minimum of information. If you instead multiplied with a number like 16, you would lose four bits of information each time, so after eight items the hash code from the first item would be completely gone.)