当前位置：文江博客话题详情

为什么 C# 不实现集合的 GetHashCode？

发布于 2024-09-02 12:20:30 字数 240 浏览 16 评论 0原文

我正在将一些东西从 Java 移植到 C#。在 Java 中，ArrayList 的 hashcode 取决于其中的项。在 C# 中，我总是从 List 中获取相同的哈希码...

这是为什么？

对于我的某些对象，哈希码需要不同，因为它们的列表属性中的对象使对象不相等。我希望哈希码对于对象的状态始终是唯一的，并且仅当对象相等时才等于另一个哈希码。我错了吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江南月 2024-09-09 12:20:30

为了正常工作，哈希码必须是不可变的——对象的哈希码必须永远改变。

如果对象的哈希码确实发生更改，则包含该对象的任何字典都将停止工作。

由于集合不是不可变的，因此它们无法实现 GetHashCode。
相反，它们继承默认的 GetHashCode，它为对象的每个实例返回一个（希望）唯一的值。（通常基于内存地址）

回复收藏 0 原文

愿与i 2024-09-09 12:20:30

哈希码必须取决于所使用的相等性的定义，以便如果 A == B 则 A.GetHashCode() == B.GetHashCode() （但不一定是相反的情况） ; A.GetHashCode() == B.GetHashCode() 并不必然产生 A == B)。

默认情况下，值类型的相等性定义基于其值，引用类型的相等性定义基于其标识（即，默认情况下引用类型的实例仅等于其自身），因此默认的哈希码为值类型取决于它包含的字段的值*，而对于引用类型，它取决于标识。事实上，由于我们理想地希望不相等对象的哈希码不同，尤其是低位（最有可能影响重新哈希的值），因此我们通常想要两个等效但不相等的对象具有不同的哈希值。

由于对象将保持与自身相等，因此还应该清楚的是，即使对象发生了变化，GetHashCode() 的默认实现也将继续具有相同的值（即使对象的身份也不会发生变化）可变对象）。

现在，在某些情况下引用类型（或值类型）重新定义相等性。字符串就是一个例子，例如 "ABC" == "AB" + "C"。尽管比较了两个不同的字符串实例，但它们被认为是相等的。在这种情况下，必须重写GetHashCode()，以便该值与定义相等性的状态相关（在这种情况下，是包含的字符序列）。

虽然对不可变的类型执行此操作更常见，但由于多种原因，GetHashCode() 不依赖于不变性。相反，面对可变性，GetHashCode() 必须保持一致 - 更改我们用于确定散列的值，散列必须相应地更改。但请注意，如果我们使用这个可变对象作为使用哈希的结构的键，这是一个问题，因为改变对象会更改它应该存储的位置，而不是将其移动到该位置（这也是如此集合中对象的位置取决于其值的任何其他情况 - 例如，如果我们对列表进行排序，然后更改列表中的一项，则列表将不再排序）。然而，这并不意味着我们必须只在字典和哈希集中使用不可变对象。相反，它意味着我们不能改变这种结构中的对象，而使其不可变是保证这一点的明确方法。

事实上，在很多情况下，在这样的结构中存储可变对象是可取的，只要我们在这段时间内不改变它们，就可以了。由于我们没有不变性带来的保证，因此我们希望以另一种方式提供它（例如，在集合中花费很短的时间并且只能从一个线程访问）。

因此，键值的不变性是可能的情况之一，但通常只是一个想法。不过，对于定义哈希码算法的人来说，他们并不认为任何这种情况总是一个坏主意（他们甚至不知道当对象存储在这样的结构中时发生了突变）；他们需要实现在对象当前状态上定义的哈希码，无论在给定点调用它是否好。因此，例如，不应在可变对象上记忆哈希码，除非在每次变异时都清除了记忆。（无论如何，记忆散列通常是一种浪费，因为重复命中相同对象散列码的结构将有自己的记忆）。

现在，在当前的情况下，ArrayList 在基于身份的默认相等情况下进行操作，例如：

ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
  a.Add(i);
  b.Add(i);
}
return a == b;//returns false

现在，这实际上是一件好事。为什么？好吧，你怎么知道上面我们要认为 a 等于 b 呢？我们可能会这样做，但在其他情况下也有很多充分的理由不这样做。

更重要的是，将平等从基于身份重新定义为基于价值，比从基于价值到基于身份重新定义要容易得多。最后，对于许多对象来说，有不止一种基于值的相等定义（典型的情况是对字符串相等的不同观点），因此甚至没有一个唯一有效的定义。例如：

ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
  c.Add(i);
}

如果我们上面考虑了a == b，那么我们是否也应该考虑a == c？答案取决于我们在使用的平等定义中所关心的内容，因此框架无法知道所有情况的正确答案是什么，因为并非所有情况都一致。

现在，如果我们确实关心给定情况下基于价值的平等，我们有两个非常简单的选择。第一个是子类化并覆盖相等性：

public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
  /*.. most methods left out ..*/
  public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
  {
    if(other == null)
      return false;
    if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
      return true;
    if(Count != other.Count)
      return false;
    for(int i = 0; i != Count; ++i)
      if(this[i] != other[i])
        return false;
    return true;
  }
  public override bool Equals(object other)
  {
    return Equals(other as ValueEqualList);
  }
  public override int GetHashCode()
  {
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

这假设我们总是希望以这种方式处理此类列表。我们还可以针对给定情况实现 IEqualityComparer：

public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
  public bool Equals(ArrayList x, ArrayList y)
  {
    if(ReferenceEquals(x, y))
      return true;
    if(x == null || y == null || x.Count != y.Count)
      return false;
    for(int i = 0; i != x.Count; ++i)
      if(x[i] != y[i])
        return false;
    return true;
  }
  public int GetHashCode(ArrayList obj)
  {
    int res = 0x2D2816FE;
    foreach(var item in obj)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

总而言之：

引用类型的默认相等定义仅取决于标识。
大多数时候，我们都想要这样。
当定义该类的人决定这不是想要的时，他们可以覆盖此行为。
当使用该类的人再次想要不同的相等定义时，他们可以使用 IEqualityComparer 和 IEqualityComparer，以便字典、哈希图、哈希集等使用它们的平等的概念。
当一个对象是基于哈希的结构的关键时，改变它是灾难性的。不变性可用于确保这种情况不会发生，但不是强制性的，也不总是可取的。

总而言之，该框架为我们提供了很好的默认值和详细的覆盖可能性。

*在结构中存在小数的情况下存在错误，因为在某些情况下，当安全时，结构会使用快捷方式，而在其他情况下则不然，但是，当包含小数的结构是一种情况时，快捷方式会被使用。 cut 不安全，它被错误地识别为安全的情况。

Hashcodes must depend upon the definition of equality being used so that if A == B then A.GetHashCode() == B.GetHashCode() (but not necessarily the inverse; A.GetHashCode() == B.GetHashCode() does not entail A == B).

By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes.

Since an object will remain equal to itself, it should also be clear that this default implementation of GetHashCode() will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object).

Now, in some cases reference types (or value types) re-define equality. An example of this is string, where for example "ABC" == "AB" + "C". Though there are two different instances of string compared, they are considered equal. In this case GetHashCode() must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained).

While it is more common to do this with types that also are immutable, for a variety of reasons, GetHashCode() does not depend upon immutability. Rather, GetHashCode() must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - e.g. if we sort a list and then mutate one of the items in the list, the list is no longer sorted). However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this.

Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example).

Hence immutability of key values is one of those cases where something is possible, but generally a idea. To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it).

Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, e.g.:

ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
  a.Add(i);
  b.Add(i);
}
return a == b;//returns false

Now, this is actually a good thing. Why? Well, how do you know in the above that we want to consider a as equal to b? We might, but there are plenty of good reasons for not doing so in other cases too.

What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. For example:

ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
  c.Add(i);
}

If we considered a == b above, should we consider a == c aslo? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree.

Now, if we do care about value-based equality in a given case we have two very easy options. The first is to subclass and over-ride equality:

public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
  /*.. most methods left out ..*/
  public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
  {
    if(other == null)
      return false;
    if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
      return true;
    if(Count != other.Count)
      return false;
    for(int i = 0; i != Count; ++i)
      if(this[i] != other[i])
        return false;
    return true;
  }
  public override bool Equals(object other)
  {
    return Equals(other as ValueEqualList);
  }
  public override int GetHashCode()
  {
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

This assumes that we will always want to treat such lists this way. We can also implement an IEqualityComparer for a given case:

public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
  public bool Equals(ArrayList x, ArrayList y)
  {
    if(ReferenceEquals(x, y))
      return true;
    if(x == null || y == null || x.Count != y.Count)
      return false;
    for(int i = 0; i != x.Count; ++i)
      if(x[i] != y[i])
        return false;
    return true;
  }
  public int GetHashCode(ArrayList obj)
  {
    int res = 0x2D2816FE;
    foreach(var item in obj)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

In summary:

The default equality definition of a reference type is dependant upon identity alone.
Most of the time, we want that.
When the person defining the class decides that this isn't what is wanted, they can override this behaviour.
When the person using the class wants a different definition of equality again, they can use IEqualityComparer<T> and IEqualityComparer so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality.
It's disastrous to mutate an object while it is the key to a hash-based structure. Immutability can be used of ensure this doesn't happen, but is not compulsory, nor always desirable.

All in all, the framework gives us nice defaults and detailed override possibilities.

*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe.

回复收藏 0 原文