为什么 C# 不实现集合的 GetHashCode?
我正在将一些东西从 Java 移植到 C#。在 Java 中,ArrayList
的 hashcode
取决于其中的项。在 C# 中,我总是从 List
中获取相同的哈希码...
这是为什么?
对于我的某些对象,哈希码需要不同,因为它们的列表属性中的对象使对象不相等。我希望哈希码对于对象的状态始终是唯一的,并且仅当对象相等时才等于另一个哈希码。我错了吗?
I am porting something from Java to C#. In Java the hashcode
of a ArrayList
depends on the items in it. In C# I always get the same hashcode from a List
...
Why is this?
For some of my objects the hashcode needs to be different because the objects in their list property make the objects non-equal. I would expect that a hashcode is always unique for the object's state and only equals another hashcode when the object is equal. Am I wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
为了正常工作,哈希码必须是不可变的——对象的哈希码必须永远改变。
如果对象的哈希码确实发生更改,则包含该对象的任何字典都将停止工作。
由于集合不是不可变的,因此它们无法实现
GetHashCode
。相反,它们继承默认的 GetHashCode,它为对象的每个实例返回一个(希望)唯一的值。 (通常基于内存地址)
In order to work correctly, hashcodes must be immutable – an object's hash code must never change.
If an object's hashcode does change, any dictionaries containing the object will stop working.
Since collections are not immutable, they cannot implement
GetHashCode
.Instead, they inherit the default
GetHashCode
, which returns a (hopefully) unique value for each instance of an object. (Typically based on a memory address)哈希码必须取决于所使用的相等性的定义,以便如果
A == B
则A.GetHashCode() == B.GetHashCode()
(但不一定是相反的情况) ;A.GetHashCode() == B.GetHashCode()
并不必然产生A == B
)。默认情况下,值类型的相等性定义基于其值,引用类型的相等性定义基于其标识(即,默认情况下引用类型的实例仅等于其自身),因此默认的哈希码为值类型取决于它包含的字段的值*,而对于引用类型,它取决于标识。事实上,由于我们理想地希望不相等对象的哈希码不同,尤其是低位(最有可能影响重新哈希的值),因此我们通常想要两个等效但不相等的对象具有不同的哈希值。
由于对象将保持与自身相等,因此还应该清楚的是,即使对象发生了变化,
GetHashCode()
的默认实现也将继续具有相同的值(即使对象的身份也不会发生变化)可变对象)。现在,在某些情况下引用类型(或值类型)重新定义相等性。字符串就是一个例子,例如
"ABC" == "AB" + "C"
。尽管比较了两个不同的字符串实例,但它们被认为是相等的。在这种情况下,必须重写GetHashCode()
,以便该值与定义相等性的状态相关(在这种情况下,是包含的字符序列)。虽然对不可变的类型执行此操作更常见,但由于多种原因,
GetHashCode()
不依赖于不变性。相反,面对可变性,GetHashCode()
必须保持一致 - 更改我们用于确定散列的值,散列必须相应地更改。但请注意,如果我们使用这个可变对象作为使用哈希的结构的键,这是一个问题,因为改变对象会更改它应该存储的位置,而不是将其移动到该位置(这也是如此集合中对象的位置取决于其值的任何其他情况 - 例如,如果我们对列表进行排序,然后更改列表中的一项,则列表将不再排序)。然而,这并不意味着我们必须只在字典和哈希集中使用不可变对象。相反,它意味着我们不能改变这种结构中的对象,而使其不可变是保证这一点的明确方法。事实上,在很多情况下,在这样的结构中存储可变对象是可取的,只要我们在这段时间内不改变它们,就可以了。由于我们没有不变性带来的保证,因此我们希望以另一种方式提供它(例如,在集合中花费很短的时间并且只能从一个线程访问)。
因此,键值的不变性是可能的情况之一,但通常只是一个想法。不过,对于定义哈希码算法的人来说,他们并不认为任何这种情况总是一个坏主意(他们甚至不知道当对象存储在这样的结构中时发生了突变);他们需要实现在对象当前状态上定义的哈希码,无论在给定点调用它是否好。因此,例如,不应在可变对象上记忆哈希码,除非在每次变异时都清除了记忆。 (无论如何,记忆散列通常是一种浪费,因为重复命中相同对象散列码的结构将有自己的记忆)。
现在,在当前的情况下,ArrayList 在基于身份的默认相等情况下进行操作,例如:
现在,这实际上是一件好事。为什么?好吧,你怎么知道上面我们要认为 a 等于 b 呢?我们可能会这样做,但在其他情况下也有很多充分的理由不这样做。
更重要的是,将平等从基于身份重新定义为基于价值,比从基于价值到基于身份重新定义要容易得多。最后,对于许多对象来说,有不止一种基于值的相等定义(典型的情况是对字符串相等的不同观点),因此甚至没有一个唯一有效的定义。例如:
如果我们上面考虑了
a == b
,那么我们是否也应该考虑a == c
?答案取决于我们在使用的平等定义中所关心的内容,因此框架无法知道所有情况的正确答案是什么,因为并非所有情况都一致。现在,如果我们确实关心给定情况下基于价值的平等,我们有两个非常简单的选择。第一个是子类化并覆盖相等性:
这假设我们总是希望以这种方式处理此类列表。我们还可以针对给定情况实现 IEqualityComparer:
总而言之:
IEqualityComparer
和IEqualityComparer
,以便字典、哈希图、哈希集等使用它们的平等的概念。总而言之,该框架为我们提供了很好的默认值和详细的覆盖可能性。
*在结构中存在小数的情况下存在错误,因为在某些情况下,当安全时,结构会使用快捷方式,而在其他情况下则不然,但是,当包含小数的结构是一种情况时,快捷方式会被使用。 cut 不安全,它被错误地识别为安全的情况。
Hashcodes must depend upon the definition of equality being used so that if
A == B
thenA.GetHashCode() == B.GetHashCode()
(but not necessarily the inverse;A.GetHashCode() == B.GetHashCode()
does not entailA == B
).By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes.
Since an object will remain equal to itself, it should also be clear that this default implementation of
GetHashCode()
will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object).Now, in some cases reference types (or value types) re-define equality. An example of this is string, where for example
"ABC" == "AB" + "C"
. Though there are two different instances of string compared, they are considered equal. In this caseGetHashCode()
must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained).While it is more common to do this with types that also are immutable, for a variety of reasons,
GetHashCode()
does not depend upon immutability. Rather,GetHashCode()
must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - e.g. if we sort a list and then mutate one of the items in the list, the list is no longer sorted). However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this.Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example).
Hence immutability of key values is one of those cases where something is possible, but generally a idea. To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it).
Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, e.g.:
Now, this is actually a good thing. Why? Well, how do you know in the above that we want to consider a as equal to b? We might, but there are plenty of good reasons for not doing so in other cases too.
What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. For example:
If we considered
a == b
above, should we considera == c
aslo? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree.Now, if we do care about value-based equality in a given case we have two very easy options. The first is to subclass and over-ride equality:
This assumes that we will always want to treat such lists this way. We can also implement an IEqualityComparer for a given case:
In summary:
IEqualityComparer<T>
andIEqualityComparer
so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality.All in all, the framework gives us nice defaults and detailed override possibilities.
*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe.
是的,你错了。在 Java 和 C# 中,相等意味着具有相同的哈希码,但反之则不一定成立。
有关详细信息,请参阅 GetHashCode。
Yes, you are wrong. In both Java and C#, being equal implies having the same hash-code, but the converse is not (necessarily) true.
See GetHashCode for more information.
哈希码不可能在大多数非平凡类的所有变体中都是唯一的。在 C# 中,列表相等的概念与 Java 中的不同(请参阅 此处),因此哈希码实现也不相同 - 它反映了 C# 列表相等性。
It is not possible for a hashcode to be unique across all variations of most non-trivial classes. In C# the concept of List equality is not the same as in Java (see here), so the hash code implementation is also not the same - it mirrors the C# List equality.
你只错了一部分。当您认为相等的哈希码意味着相等的对象时,您绝对错了,但是相等的对象必须具有相同的哈希码,这意味着如果哈希码不同,则对象也会不同。
You're only partly wrong. You're definitely wrong when you think that equal hashcodes means equal objects, but equal objects must have equal hashcodes, which means that if the hashcodes differ, so do the objects.
核心原因是性能和人性 - 人们倾向于认为哈希是快速的东西,但它通常需要至少遍历一次对象的所有元素。
示例:如果您使用字符串作为哈希表中的键,则每个查询的复杂度都是 O(|s|) - 使用 2 倍长的字符串,您的成本将至少是原来的两倍。想象一下,它是一棵完整的树(只是列表的列表) - 哎呀:-)
如果完整的深度哈希计算是集合上的标准操作,那么很大比例的程序员会不经意地使用它,然后归咎于框架和虚拟机速度慢。 对于像完全遍历这样昂贵的事情,程序员必须意识到其复杂性至关重要。实现这一点的唯一方法是确保您必须编写自己的。这也是一个很好的威慑:-)
另一个原因是更新战术。动态计算和更新哈希与每次都进行完整计算需要根据具体情况进行判断。
不变性只是学术上的逃避 - 人们使用散列作为更快地检测更改的一种方式(例如文件散列),并且还对不断变化的复杂结构使用散列。除了 101 条基础知识之外,哈希还有更多用途。 关键是,复杂对象的哈希所使用的内容必须根据具体情况进行判断。
使用对象的地址(实际上是一个句柄,因此在 GC 后它不会改变) )作为哈希,实际上是任意可变对象的哈希值保持相同的情况:-) C# 这样做的原因是它很便宜,并且再次促使人们计算自己的哈希值。
The core reasons are performance and human nature - people tend to think about hashes as something fast but it normally requires traversing all elements of an object at least once.
Example: If you use a string as a key in a hash table every query has complexity O(|s|) - use 2x longer strings and it will cost you at least twice as much. Imagine that it was a full blown tree (just a list of lists) - oops :-)
If full, deep hash calculation was a standard operation on a collection, enormous percentage of progammers would just use it unwittingly and then blame the framework and the virtual machine for being slow. For something as expensive as full traversal it is crucial that a programmer has to be aware of the complexity. The only was to achieve that is to make sure that you have to write your own. It's a good deterrent as well :-)
Another reason is updating tactics. Calculating and updating a hash on the fly vs. doing the full calculation every time requires a judgement call depending on concrete case in hand.
Immutabilty is just an academic cop out - people do hashes as a way of detecting a change faster (file hashes for example) and also use hashes for complex structures which change all the time. Hash has many more uses beyong the 101 basics. The key is again that what to use for a hash of a complex object has to be a judgement call on a case by case basis.
Using object's address (actually a handle so it doesn't change after GC) as a hash is actually the case where the hash value remains the same for arbitrary mutable object :-) The reason C# does it is that it's cheap and again nudges people to calculate their own.
为什么太哲学了。创建辅助方法(可能是扩展方法)并根据需要计算哈希码。可能是 XOR 元素的哈希码
Why is too philosophical. Create helper method (may be extension method) and calculate hashcode as you like. May be XOR elements' hashcodes