为什么 ValueType.GetHashCode() 是这样实现的?

发布于 2024-09-25 10:09:23 字数 920 浏览 6 评论 0原文

来自 ValueType.cs

**Action: Our algorithm for returning the hashcode is a little bit complex. We look 
**        for the first non-static field and get it's hashcode.  If the type has no 
**        non-static fields, we return the hashcode of the type. We can't take the
**        hashcode of a static member because if that member is of the same type as 
**        the original type, we'll end up in an infinite loop.

今天,当我使用 KeyValuePair 作为字典中的键(它存储 xml 属性名称(枚举)和它的值(字符串))时,我被这个问题困扰了,并且期望它根据其所有字段计算其哈希码,但根据实现,它只考虑了 Key 部分。

示例(来自 Linqpad 的 c/p):

void Main()
{
    var kvp1 = new KeyValuePair<string, string>("foo", "bar");
    var kvp2 = new KeyValuePair<string, string>("foo", "baz");

    // true
    (kvp1.GetHashCode() == kvp2.GetHashCode()).Dump();
}

我猜第一个非静态字段意味着声明顺序中的第一个字段,无论出于何种原因更改源中的变量顺序时,这也可能会导致麻烦,并且相信它不会在语义上更改代码。

From ValueType.cs

**Action: Our algorithm for returning the hashcode is a little bit complex. We look 
**        for the first non-static field and get it's hashcode.  If the type has no 
**        non-static fields, we return the hashcode of the type. We can't take the
**        hashcode of a static member because if that member is of the same type as 
**        the original type, we'll end up in an infinite loop.

I got bitten by this today when I was using a KeyValuePair as a key in a Dictionary (it stored xml attribute name (enum) and it's value (string)), and expected for it to have it's hashcode computed based on all its fields, but according to implementation it only considered the Key part.

Example (c/p from Linqpad):

void Main()
{
    var kvp1 = new KeyValuePair<string, string>("foo", "bar");
    var kvp2 = new KeyValuePair<string, string>("foo", "baz");

    // true
    (kvp1.GetHashCode() == kvp2.GetHashCode()).Dump();
}

The first non-static field I guess means the first field in declaratin order, which could also cause trouble when changing variable order in source for whatever reason, and believing it doesn't change the code semantically.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

池予 2024-10-02 10:09:23

ValueType.GetHashCode() 的实际实现与注释不太匹配。它有两个版本的算法,快速和慢速。它首先检查结构是否包含引用类型的任何成员以及字段之间是否有任何填充。填充是结构体值中的空白区域,是在 JIT 编译器对齐字段时创建的。包含 bool 和 int (3 个字节)的结构中有填充,但包含 int 和 int 时没有填充,它们紧密地组合在一起。

在没有引用和填充的情况下,它可以实现快速版本,因为结构值中的每一位都是属于字段值的位。它一次只是异或 4 个字节。您将获得考虑所有成员的“良好”哈希代码。 .NET 框架中的许多简单结构类型都以这种方式运行,例如 Point 和 Size。

如果测试失败,它就会执行慢速版本,道德上相当于反思。这就是你得到的,你的 KeyValuePair<>包含参考文献。正如评论所说,这个只检查第一个候选字段。这无疑是一种性能优化,避免浪费太多时间。

是的,令人讨厌的细节并不广为人知。当有人注意到他们的收集代码很烂时,通常会发现这一点。

另一个令人痛苦的细节是:快速版本有一个错误,当结构包含十进制类型的字段时,该错误会发生字节。值 12m 和 12.0m 在逻辑上相等,但它们不具有相同的位模式。 GetHashCode() 会说它们不相等。哎哟。

The actual implementation of ValueType.GetHashCode() doesn't quite match the comment. It has two versions of the algorithm, fast and slow. It first checks if the struct contains any members of a reference type and if there is any padding between the fields. Padding is empty space in a structure value, created when the JIT compiler aligns the fields. There's padding in a struct that contains bool and int (3 bytes) but no padding when it contains int and int, they fit snugly together.

Without a reference and without padding, it can do the fast version since every bit in the structure value is a bit that belongs to a field value. It simply xors 4 bytes at a time. You'll get a 'good' hash code that considers all the members. Many simple structure types in the .NET framework behave this way, like Point and Size.

Failing that test, it does the slow version, the moral equivalent of reflection. That's what you get, your KeyValuePair<> contains references. And this one only checks the first candidate field, like the comment says. This is surely a perf optimization, avoiding burning too much time.

Yes, nasty detail and not that widely known. It is usually discovered when somebody notices that their collection code sucks mud.

One more excruciating detail: the fast version has a bug that bytes when the structure contains a field of a type decimal. The values 12m and 12.0m are logically equal but they don't have the same bit pattern. GetHashCode() will say that they are not equal. Ouch.

一笑百媚生 2024-10-02 10:09:23

更新:这个答案(部分)是a的基础我写的博客文章更详细地介绍了 GetHashcode 的设计特征。感谢您提出有趣的问题!


我没有实施它,也没有与实施它的人交谈。但我可以指出一些事情。

(在继续之前,请注意,这里我专门谈论哈希码是为了平衡哈希表,其中表的内容是由非敌对用户选择的。哈希码用于数字签名、冗余检查或当一些用户对表提供者发起拒绝服务攻击时确保哈希表的良好性能超出了本讨论的范围。)

首先,正如 Jon 正确指出的那样,给定的算法确实实现了 GetHashCode 所需的契约。对于您的目的来说,这可能不是最佳选择,但它是合法的。 要求是比较相等的事物具有相等的哈希码。

那么除了这份合同之外,还有哪些“最好的东西”呢?一个好的哈希码实现应该是:

1)快速。非常快!请记住,哈希码的首要目的是快速在哈希表中找到一个相对空的槽。如果哈希码的 O(1) 计算实际上比简单查找所需的 O(n) 时间慢,那么哈希码解决方案就是净损失。

2) 对于给定的输入分布,在 32 位整数空间中分布良好。整数之间的分布越差,哈希表就越像简单的线性查找。

那么,考虑到这两个相互冲突的目标,您将如何为任意值类型制定哈希算法呢?任何花在保证良好分布的复杂哈希算法上的时间都是浪费的。

一个常见的建议是“对所有字段进行散列,然后将所得散列码异或在一起”。但这是回避问题的;仅当输入本身分布非常均匀且彼此不相关时,对两个 32 位整数进行异或才能得到良好的分布,而这是不太可能的情况:

// (Updated example based on good comment!)
struct Control
{
    string name;
    int x;
    int y;
}

x 和 y 在 32 的整个范围内均匀分布的可能性是多少位整数?非常低。它们并且彼此接近的可能性要大得多,在这种情况下,将它们的哈希码异或在一起会使事情更糟,而不是< em>更好。将彼此接近的整数异或在一起会将大部分位清零。

此外,字段数量为 O(n)!具有许多小字段的值类型将花费相对较长的时间来计算哈希码。

基本上我们所处的情况是用户自己没有提供哈希码实现;要么他们不关心,要么他们不希望这种类型被用作哈希表中的键。鉴于您没有关于该类型的任何语义信息,那么最好的做法是什么?最好的办法就是速度快并且大多数时候都能得到好的结果。

大多数时候,两个不同的结构体实例在其大部分字段中会有所不同,而不仅仅是一个字段,因此只需选择其中一个并希望它是一个不同的似乎是合理的。

大多数时候,两个不同的结构实例在其字段中会有一些冗余,因此将许多字段的哈希值组合在一起可能会减少而不是增加哈希值中的熵,即使它消耗了计算哈希值的时间。哈希算法是为了保存而设计的。

将此与 C# 中匿名类型的设计进行比较。对于匿名类型,我们确实知道该类型很可能被用作表的键。我们确实知道匿名类型的实例之间很可能存在冗余(因为它们是笛卡尔积或其他连接的结果)。因此,我们确实将所有字段的哈希码合并为一个哈希码。如果由于计算的哈希码数量过多而导致性能不佳,您可以自由使用自定义名义类型而不是匿名类型。

UPDATE: This answer was (in part) the basis of a blog article I wrote which goes into more details about the design characteristics of GetHashcode. Thanks for the interesting question!


I didn't implement it and I haven't talked to the people who did. But I can point out a few things.

(Before I go on, note that here I am specifically talking about hash codes for the purposes of balancing hash tables where the contents of the table are chosen by non-hostile users. The problems of hash codes for digital signing, redundancy checking, or ensuring good performance of a hash table when some of the users are mounting denial-of-service attacks against the table provider are beyond the scope of this discussion.)

First, as Jon correctly notes, the given algorithm does implement the required contract of GetHashCode. It might be sub-optimal for your purposes, but it is legal. All that is required is that things that compare equal have equal hash codes.

So what are the "nice to haves" in addition to that contract? A good hash code implementation should be:

1) Fast. Very fast! Remember, the whole point of the hash code in the first place is to rapidly find a relatively empty slot in a hash table. If the O(1) computation of the hash code is in practice slower than the O(n) time taken to do the lookup naively then the hash code solution is a net loss.

2) Well distributed across the space of 32 bit integers for the given distribution of inputs. The worse the distribution across the ints, the more like a naive linear lookup the hash table is going to be.

So, how would you make a hash algorithm for arbitrary value types given those two conflicting goals? Any time you spend on a complex hash algorithm that guarantees good distribution is time poorly spent.

A common suggestion is "hash all of the fields and then XOR together the resulting hash codes". But that is begging the question; XORing two 32 bit ints only gives good distribution when the inputs themselves are extremely well-distributed and not related to each other, and that is an unlikely scenario:

// (Updated example based on good comment!)
struct Control
{
    string name;
    int x;
    int y;
}

What is the likelihood that x and y are well-distributed over the entire range of 32 bit integers? Very low. Odds are much better that they are both small and close to each other, in which case xoring their hash codes together makes things worse, not better. xoring together integers that are close to each other zeros out most of the bits.

Furthermore, this is O(n) in the number of fields! A value type with a lot of small fields would take a comparatively long time to compute the hash code.

Basically the situation we're in here is that the user didn't provide a hash code implementation themselves; either they don't care, or they don't expect this type to ever be used as a key in a hash table. Given that you have no semantic information whatsoever about the type, what's the best thing to do? The best thing to do is whatever is fast and gives good results most of the time.

Most of the time, two struct instances that differ will differ in most of their fields, not just one of their fields, so just picking one of them and hoping that it's the one that differs seems reasonable.

Most of the time, two struct instances that differ will have some redundancy in their fields, so combining the hash values of many fields together is likely to decrease, not increase, the entropy in the hash value, even as it consumes the time that the hash algorithm is designed to save.

Compare this with the design of anonymous types in C#. With anonymous types we do know that it is highly likely that the type is being used as a key to a table. We do know that it is highly likely that there will be redundancy across instances of anonymous types (because they are results of a cartesian product or other join). And therefore we do combine the hash codes of all of the fields into one hash code. If that gives you bad performance due to the excess number of hash codes being computed, you are free to use a custom nominal type rather than the anonymous type.

时光倒影 2024-10-02 10:09:23

即使字段顺序发生变化,它仍然应该遵守 GetHashCode 的约定:在该进程的生命周期内,相同的值将具有相同的哈希码。

特别是:

  • 不相等的值不必具有不相等的哈希码
  • 哈希码不必在各个进程之间保持一致(您可以更改实现、重建,并且一切都应该仍然有效 - 您不应该坚持下去哈希码,基本上)

现在我并不是说 ValueType 的实现是一个好主意 - 它会以各种方式导致性能下降......但我不认为它实际上被破坏

It should still obey the contract of GetHashCode even if the field order changes: equal values will have equal hash codes, within the lifetime of that process.

In particular:

  • Non-equal values don't have to have non-equal hash codes
  • Hash codes don't have to be consistent across processes (you can change an implementation, rebuild, and everything should still work - you shouldn't be persisting hash codes, basically)

Now I'm not saying that ValueType's implementation is a great idea - it will cause performance suckage in various ways... but I don't think it's actually broken.

冧九 2024-10-02 10:09:23

嗯,任何 GetHashCode() 的实现都有利有弊。这些当然是我们在实现自己的方案时权衡的因素,但是对于 ValueType.GetHashCode() 来说,有一个特别的困难,因为他们没有太多关于实际细节的信息。具体类型将是。当然,当我们创建一个抽象类或一个旨在作为类的基础的类时,这种情况经常发生在我们身上,这些类会在状态方面添加更多内容,但在这些情况下,我们有一个明显的解决方案,即仅使用默认实现object.GetHashCode() 除非派生类愿意在那里重写它。

对于 ValueType.GetHashCode(),他们没有这种奢侈,因为值类型和引用类型之间的主要区别是,尽管谈论堆栈与堆的实现细节很流行,但事实上对于值类型,等价与值相关,而对于对象类型,等价与身份相关(即使对象通过重写 Equals()GetHashCode() 定义了不同形式的等价。 code> 引用相等的概念仍然存在并且仍然有用,

因此,对于 Equals() 方法,其实现是显而易见的;检查两个对象是否是同一类型,如果是则检查。另外,所有字段都是相等的(实际上,在某些情况下有一个按位比较的优化,但这是基于相同基本思想的优化)。

对于 GetHashCode() 该怎么办?他们可以做的一件事是在每个字段上进行某种先乘后加或先移位后异或。这可能会给出一个相当好的哈希码,但如果有很多字段,可能会很昂贵(不要介意,不建议使用具有很多字段的值类型,实现者必须考虑到它们仍然可以,而且确实甚至有时它可能是有意义的,尽管老实说我无法想象它既有意义又对散列它有意义的时候)。如果他们知道某些字段在实例之间很少有不同,他们就可以忽略这些字段,并且仍然拥有相当好的哈希码,同时速度也相当快。最后,他们可以忽略大多数字段,并希望他们不忽略的字段在大多数情况下的值会有所不同。他们选择了后者最极端的版本。

(当没有实例字段时做什么是另一回事,也是一个很好的选择,这样的值类型等于同一类型的所有其他实例,并且它们有一个与之匹配的哈希码)。

因此,如果您对第一个字段相同的大量值进行哈希处理(或者返回相同的哈希码),那么这种实现会很糟糕,但其他实现在其他情况下会很糟糕(Mono 将所有字段的哈希码异或在一起,在你的情况下更好,在其他情况下更糟)。

更改字段顺序的问题并不重要,因为散列码非常明确地声明为仅在进程的生命周期内保持有效,并且不适合大多数可以持久保存的情况(在某些缓存情况下可能很有用)如果代码更改后未正确找到内容也没有什么坏处)。

所以,虽然不是很好,但没有什么是完美的。这表明,在使用对象作为键时,必须始终考虑“平等”含义的两面。在您的情况下很容易修复它:

public class KVPCmp<TKey, TValue> : IEqualityComparer<KeyValuePair<TKey, TValue>>, IEqualityComparer
{
  bool IEqualityComparer.Equals(object x, object y)
  {
      if(x == null)
        return y == null;
      if(y == null)
        return false;
      if(!(x is KeyValuePair<TKey, TValue>) || !(y is KeyValuePair<TKey, TValue>))
        throw new ArgumentException("Comparison of KeyValuePairs only.");
      return Equals((KeyValuePair<TKey, TValue>) x, (KeyValuePair<TKey, TValue>) y);
  }
  public bool Equals(KeyValuePair<TKey, TValue> x, KeyValuePair<TKey, TValue> y)
  {
      return x.Key.Equals(y.Key) && x.Value.Equals(y.Value);
  }
  public int GetHashCode(KeyValuePair<TKey, TValue> obj)
  {
      int keyHash = obj.GetHashCode();
      return ((keyHash << 16) | (keyHash >> 16)) ^ obj.Value.GetHashCode();
  }
  public int GetHashCode(object obj)
  {
      if(obj == null)
        return 0;
      if(!(obj is KeyValuePair<TKey, TValue>))
       throw new ArgumentException();
      return GetHashCode((KeyValuePair<TKey, TValue>)obj);
  }
}

在创建字典时使用它作为比较器,一切都应该很好(您实际上只需要通用比较器方法,但保留其余部分没有任何害处,有时可能会很有用)。

Well, there are pros and cons to any implementation of GetHashCode(). These are of course the things we weigh up when implementing our own, but in the case of ValueType.GetHashCode() there is a particular difficulty in that they don't have much information on what the actual details of the concrete type will be. Of course, this often happens to us when we are creating an abstract class or one intended to be the base of classes that will add a lot more in terms of state, but in those cases we have an obvious solution of just using the default implementation of object.GetHashCode() unless a derived class cares to override it there.

With ValueType.GetHashCode() they don't have this luxury as the primary difference between a value type and a reference type is, despite the popularity of talking about implementation details of stack vs. heap, the fact that for a value type equivalence relates to value while for an object type equivalence relates to identity (even when an object defines a different form of equivalence by overriding Equals() and GetHashCode() the concept of reference-equality still exists and is still useful.

So, for the Equals() method the implementation is obvious; check the two objects are the same type, and if it is then check also that all fields are equal (actually there's an optimisation which does a bitwise comparison in some cases, but that's an optimisation on the same basic idea).

What to do for GetHashCode()? There simply is no perfect solution. One thing they could do is some sort of mult-then-add or shift-then-xor on every field. That would likely give a pretty good hashcode, but could be expensive if there were a lot of fields (never mind that its not recommended to have value-types that have lots of fields, the implementer has to consider that they still can, and indeed there might even be times when it makes sense, though I honestly can't imagine a time where it both makes sense and it also makes sense to hash it). If they knew some fields were rarely different between instances they could ignore those fields and still have a pretty good hashcode, while also being quite fast. Finally, they can ignore most fields, and hope that the one(s) they don't ignore vary in value most of the time. They went for the most extreme version of the latter.

(The matter of what is done when there is no instance fields is another matter and a pretty good choice, such value types are equal to all other instances of the same type, and they've a hashcode that matches that).

So, it's an implementation that sucks if you are hashing lots of values where the first field is the same (or otherwise returns the same hashcode), but other implementations would suck in other cases (Mono goes for xoring all the fields' hashcodes together, better in your case, worse in others).

The matter of changing field order doesn't matter, as hashcode is quite clearly stated as only remaining valid for the lifetime of a process and not being suitable for most cases where they could be persisted beyond that (can be useful in some caching situations where it doesn't hurt if things aren't found correctly after a code change).

So, not great, but nothing would be perfect. It goes to show that one must always consider both sides of what "equality" means when using an object as a key. It's easily fixed in your case with:

public class KVPCmp<TKey, TValue> : IEqualityComparer<KeyValuePair<TKey, TValue>>, IEqualityComparer
{
  bool IEqualityComparer.Equals(object x, object y)
  {
      if(x == null)
        return y == null;
      if(y == null)
        return false;
      if(!(x is KeyValuePair<TKey, TValue>) || !(y is KeyValuePair<TKey, TValue>))
        throw new ArgumentException("Comparison of KeyValuePairs only.");
      return Equals((KeyValuePair<TKey, TValue>) x, (KeyValuePair<TKey, TValue>) y);
  }
  public bool Equals(KeyValuePair<TKey, TValue> x, KeyValuePair<TKey, TValue> y)
  {
      return x.Key.Equals(y.Key) && x.Value.Equals(y.Value);
  }
  public int GetHashCode(KeyValuePair<TKey, TValue> obj)
  {
      int keyHash = obj.GetHashCode();
      return ((keyHash << 16) | (keyHash >> 16)) ^ obj.Value.GetHashCode();
  }
  public int GetHashCode(object obj)
  {
      if(obj == null)
        return 0;
      if(!(obj is KeyValuePair<TKey, TValue>))
       throw new ArgumentException();
      return GetHashCode((KeyValuePair<TKey, TValue>)obj);
  }
}

Use this as your comparator when creating your dictionary, and all should be well (you only need the generic comparator methods really, but leaving the rest in does no harm and can be useful to have sometimes).

以为你会在 2024-10-02 10:09:23

谢谢大家非常非常信息丰富的答案。我知道这个决定一定有一些理由,但我希望能更好地记录下来。我无法使用框架的 v4,因此没有 Tuple<>,这是我决定搭载 KeyValuePair 结构的主要原因。但我想没有捷径可走,我必须自己走捷径。再次感谢大家。

Thank you all on very, very informative answers. I knew that there had to be some rationale in that decision but I wish it was documented better. I'm not able to use v4 of the framework so there is no Tuple<>, and that was the main reason I decided to piggyback on KeyValuePair struct. But I guess there is no cutting corners and I'll have to roll my own. Once more, thank you all.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文