Enum.hashCode() 背后的原因是什么?
Enum 类中的 hashCode() 方法是最终方法,定义为 super.hashCode(),这意味着它返回一个基于实例地址的数字,该数字是来自程序员 POV 的随机数。
将其定义为 ordinal() ^ getClass().getName().hashCode()
将在不同的 JVM 中具有确定性。它甚至会工作得更好一点,因为最低有效位将“尽可能多地改变”,例如,对于包含最多 16 个元素的枚举和大小为 16 的 HashMap,肯定不会发生冲突(当然,使用 EnumMap 更好,但有时不可能,例如没有 ConcurrentEnumMap)。根据当前的定义,你没有这样的保证,不是吗?
答案摘要
使用 Object.hashCode()
与上面的更好的 hashCode 相比,如下所示:
- PROS
- 简单
- 对比
- 速度
- 更多冲突(对于任何大小的 HashMap)
- 非确定性,它会传播到其他对象,使它们无法用于
- 确定性模拟
- ETag 计算
- 根据
HashSet
迭代顺序寻找错误
,我个人更喜欢更好的 hashCode,但恕我直言,除了速度之外,没有任何原因很重要。
更新
我对速度感到好奇,并编写了一个 基准令人惊讶的结果。对于每个类单个字段的价格,您可以获得确定性哈希码,其速度几乎四倍。将哈希码存储在每个字段中会更快,尽管可以忽略不计。
标准哈希码速度并不快的原因是,随着对象的移动,它不可能是对象的地址GC。
更新 2
一般来说,hashCode
性能正在发生一些奇怪的事情。当我理解它们时,仍然存在一个悬而未决的问题,为什么 System.identityHashCode(从对象头读取)比访问普通对象字段慢得多。
The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode()
would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode()
compares to a nicer hashCode like the one above as follows:
- PROS
- simplicity
- CONTRAS
- speed
- more collisions (for any size of a HashMap)
- non-determinism, which propagates to other objects making them unusable for
- deterministic simulations
- ETag computation
- hunting down bugs depending e.g. on a
HashSet
iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode
performance in general. When I understand them, there's still the open question, why System.identityHashCode
(reading from the object header) is way slower than accessing a normal object field.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
首先,您不应该依赖这种机制在 JVM 之间共享对象。这根本不是受支持的用例。当您序列化/反序列化时,您应该依赖您自己的比较机制,或者仅将结果与您自己的 JVM 中的对象进行“比较”。
让枚举
hashCode
实现为Objects
哈希码(基于身份)的原因是,在一个 JVM 中,每个枚举对象只有一个实例。这足以确保这种实现是有意义且正确的。你可能会这样争论“嘿,字符串和原语的包装器(长整型,整数,...)都有明确定义的、确定性的
hashCode
规范!为什么枚举没有?有吗?”,好吧,首先,您可以有多个代表同一字符串的不同字符串引用,这意味着使用super.hashCode
将是一个错误,因此这些类必然需要他们自己的 hashCode 实现。对于这些核心类,让它们具有明确定义的确定性哈希代码是有意义的。好吧,看看
hashCode
实现的要求。主要关注的是确保每个对象都应该返回一个不同的哈希码(除非它与另一个对象相等)。基于身份的方法非常有效并保证了这一点,而您的建议却没有。这一要求显然比任何关于放松连载等的“便利奖金”都要强。First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums
hashCode
be implemented asObjects
hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of
hashCode
! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that usingsuper.hashCode
would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.Well, look at the requirements of the
hashCode
implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.我认为他们最终决定的原因是为了避免开发人员通过重写次优(甚至不正确)的 hashCode 来搬起石头砸自己的脚。
关于所选择的实现:它在 JVM 上不稳定,但速度非常快,可以避免冲突,并且不需要在枚举中添加额外的字段。考虑到枚举类的实例数量通常较少,以及 equals 方法的速度,如果您的算法的 HashMap 查找时间比当前算法更长,我不会感到惊讶,因为它的额外复杂性。
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
我也问过同样的问题,因为没看到这个。 为什么在 Enum 中 hashCode() 引用对象 hashCode() 实现,而不是 ordinal() 函数?
在定义我自己的哈希函数时,对于依赖枚举 hashCode 作为组合之一的对象,我遇到了一个问题。当检查函数返回的对象集中的值时,我按顺序检查它们,我希望它们是相同的,因为我自己定义了 hashCode,所以我希望元素落在相同的节点上在树上,但是由于 enum 返回的 hashCode 从头到尾都在变化,所以这个假设是错误的,测试偶尔会失败。
因此,当我弄清楚问题后,我开始使用序数。 我不确定每个为其对象编写 hashCode 的人都意识到了这一点。
所以基本上,你不能在依赖枚举的同时定义自己的确定性 hashCode hashCode,你需要使用序数来代替
PS 这对于评论来说太大了:)
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
JVM强制对于枚举常量,内存中只能存在一个对象。如果不通过反射,不通过序列化/反序列化跨网络,您不可能在单个虚拟机中得到同一枚举常量的两个不同实例对象。
话虽如此,由于它是表示该常量的唯一对象,因此它的 hascode 是它的地址并不重要,因为没有其他对象可以同时占用相同的地址空间。它保证是唯一的并且是唯一的。 “确定性”(在同一个虚拟机中,在内存中,所有对象都将具有相同的引用,无论它是什么)。
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
不需要哈希码在 JVM 之间具有确定性,如果是确定性也不会获得任何优势。如果你依赖这个事实,那么你就错误地使用了它们。
由于每个枚举值仅存在一个实例,因此
Object.hashcode()
保证永远不会发生冲突,是良好的代码重用并且速度非常快。如果通过身份定义相等性,则
Object.hashcode()
将始终提供最佳性能。其他哈希码的确定性只是其实现的副作用。由于它们的相等性通常由字段值定义,因此混合非确定性值会浪费时间。
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists,
Object.hashcode()
is guaranteed never to collide, is good code reuse and is very fast.If equality is defined by identity, then
Object.hashcode()
will always give the best performance.The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
只要我们不能将枚举对象1发送到不同的JVM,我认为没有理由对枚举(以及一般对象)提出这样的要求
1我认为很清楚 - 对象是类的实例。 序列化对象是一个字节序列,通常存储在字节数组中。我正在谈论一个对象。
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
我可以想象它这样实现的另一个原因是因为要求 hashCode() 和 equals() 保持一致,并且为了 Enum 的设计目标,它们应该易于使用和编译时常量(以使用它们是“case”常量)。这也使得将枚举实例与“==”进行比较是合法的,并且您根本不希望“等于”的行为与枚举的“==”不同。这再次将 hashCode 与默认的 Object.hashCode() 基于引用的行为联系起来。
如前所述,我也不期望 equals() 和 hashCode() 将来自不同 JVM 的两个枚举常量视为相等。在谈论序列化时:例如,类型为枚举的字段,Java 中默认的二进制序列化程序有一个特殊的行为,即仅序列化常量的名称,并且在反序列化时,将重新创建对反序列化 JVM 中相应枚举值的引用。 JAXB 和其他基于 XML 的序列化机制的工作方式类似。所以:不用担心
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry