注意:以下SO问题是相关的,但它们和链接的资源似乎都没有完全回答我的问题,特别是与对象集合实施相等测试有关。
背景
NSObject 提供-hash
(返回实例的地址,如 (NSUInteger)self
) 和 -isEqual:
(返回 NO
,除非接收者的地址和参数相同)。 这些方法被设计为根据需要被重写,但文档明确表明您应该提供两者或都不提供。 此外,如果 -isEqual:
对于两个对象返回 YES
,则这些对象的 -hash
结果必须是相同的。 如果不是,当将应该相同的对象(例如 -compare:
返回 NSOrderedSame
的两个字符串实例)添加到 Cocoa 集合或直接比较时,可能会出现问题。
背景
我开发了 CHDataStructures.framework,一个 Objective-C 数据结构的开源库。 我已经实现了许多集合,目前正在完善和增强它们的功能。 我想添加的功能之一是能够比较集合与另一个集合是否相等。
这些比较不应仅比较内存地址,而应考虑两个集合中存在的对象(包括排序,如果适用)。 这种方法在Cocoa中有相当多的先例,并且通常使用单独的方法,包括以下内容
我想让我的自定义集合对相等性测试具有鲁棒性,因此它们可以安全地(并且可预测地)添加到其他集合中,并允许其他集合(例如 NSSet)确定两个集合是否相等/等效/重复。
问题
-isEqualTo...:
方法本身效果很好,但定义这些方法的类通常也会重写 -isEqual:
来调用 [self isEqualTo. ..:]
如果参数与接收者属于同一类(或者可能是子类),否则为 [super isEqual:]
。 这意味着该类还必须定义 -hash
,以便为具有相同内容的不同实例返回相同的值。
另外,Apple的-hash
文档规定了以下内容:(强调我的)
“如果将可变对象添加到使用哈希值确定对象在集合中的位置的集合中,则当对象位于集合中时,对象的哈希方法返回的值不得更改。因此,或者哈希方法不得依赖于任何对象的内部状态信息或者您必须确保对象的内部状态信息在对象处于存储状态时不会改变。因此,例如,可以将可变字典放入哈希表中,但不得在其中更改它(请注意,很难知道给定的对象是否在集合中。) “
编辑: 我绝对理解为什么这是必要的,并且完全同意其推理 - 我在这里提到它是为了提供额外的背景信息,并回避了为什么会出现这种情况的主题。简洁。
我的所有集合都是可变的,并且哈希必须至少考虑一些内容,因此这里唯一的选择是将其视为改变集合的编程错误存储在另一个集合中。 (我的集合都采用 NSCopying,所以像 NSDictionary 这样的集合可以成功制作副本以用作密钥等)
对我来说实现 -isEqual:
和 -hash
是有意义的,因为(例如)间接用户我的一个类可能不知道要调用的特定 -isEqualTo...:
方法,甚至不关心两个对象是否是同一类的实例。 他们应该能够对 id
类型的任何变量调用 -isEqual:
或 -hash
并获得预期结果。
与 -isEqual:
(可以访问正在比较的两个实例)不同,-hash
必须“盲目”返回结果,只能访问特定实例内的数据。 由于它无法知道哈希值的用途,因此对于所有可能被视为相等/相同的实例,结果必须一致,并且必须始终与 一致isEqual:
。 (编辑:下面的答案已经揭穿了这一点,它确实让生活变得更轻松。)此外,编写良好的哈希函数并非易事 - 保证唯一性是一项挑战,尤其是当您只有一个NSUInteger(32/64 位)用它来表示。
问题
- 为集合实现
相等比较 -hash
时是否有最佳实践?
- Objective-C 和 Cocoa 式的集合有什么特殊之处需要规划吗?
- 是否有任何好的方法可以以合理的置信度进行单元测试
-hash
?
- 对于包含任意类型元素的集合,有关于实现
-hash
以与 -isEqual:
一致的建议吗? 我应该了解哪些陷阱? (编辑:不像我最初想象的那么有问题 - 正如@kperryua指出的那样,“相等的-hash
值不会 暗示 -isEqual:
"。)
编辑: 我应该澄清一下,我对如何实现 -isEqual: 或 -isEqualTo 并不感到困惑。 .:对于集合来说,这很简单。 我认为我的困惑主要源于(错误地)认为如果 -isEqual: 返回 NO,则 -hash 必须返回不同的值。 过去做过密码学,我认为不同值的哈希值必须不同。 然而,下面的答案让我意识到,“好的”哈希函数实际上是关于最小化存储桶冲突以及使用-hash
的集合的链接。 虽然唯一的哈希值更好,但它们不是严格要求。
Note: The following SO questions are related, but neither they nor the linked resources seem to fully answer my questions, particularly in relation to implementing equality tests for collections of objects.
Background
NSObject provides default implementations of -hash
(which returns the address of the instance, like (NSUInteger)self
) and -isEqual:
(which returns NO
unless the addresses of the receiver and the parameter are identical). These methods are designed to be overridden as necessary, but the documentation makes it clear that you should provide both or neither. Further, if -isEqual:
returns YES
for two objects, then the result of -hash
for those objects must be the same. If not, problems can ensue when objects that should be the same — such as two string instances for which -compare:
returns NSOrderedSame
— are added to a Cocoa collection or compared directly.
Context
I develop CHDataStructures.framework, an open-source library of Objective-C data structures. I have implemented a number of collections, and am currently refining and enhancing their functionality. One of the features I want to add is the ability to compare collections for equality with another.
Rather than comparing only memory addresses, these comparisons should consider the objects present in the two collections (including ordering, if applicable). This approach has quite a precedent in Cocoa, and generally uses a separate method, including the following:
I want to make my custom collections robust to tests of equality, so they may safely (and predictably) be added to other collections, and allow others (like an NSSet) to determine whether two collections are equal/equivalent/duplicates.
Problems
An -isEqualTo...:
method works great on its own, but classes which define these methods usually also override -isEqual:
to invoke [self isEqualTo...:]
if the parameter is of the same class (or perhaps subclass) as the receiver, or [super isEqual:]
otherwise. This means the class must also define -hash
such that it will return the same value for disparate instances that have the same contents.
In addition, Apple's documentation for -hash
stipulates the following: (emphasis mine)
"If a mutable object is added to a collection that uses hash values to determine the object's position in the collection, the value returned by the hash method of the object must not change while the object is in the collection. Therefore, either the hash method must not rely on any of the object's internal state information or you must make sure the object's internal state information does not change while the object is in the collection. Thus, for example, a mutable dictionary can be put in a hash table but you must not change it while it is in there. (Note that it can be difficult to know whether or not a given object is in a collection.)"
Edit: I definitely understand why this is necessary and totally agree with the reasoning — I mentioned it here to provide additional context, and skirted the topic of why it's the case for the sake of brevity.
All of my collections are mutable, and the hash will have to consider at least some of the contents, so the only option here is to consider it a programming error to mutate a collection stored in another collection. (My collections all adopt NSCopying, so collections like NSDictionary can successfully make a copy to use as a key, etc.)
It makes sense for me to implement -isEqual:
and -hash
, since (for example) an indirect user of one of my classes may not know the specific -isEqualTo...:
method to call, or even care whether two objects are instances of the same class. They should be able to call -isEqual:
or -hash
on any variable of type id
and get the expected result.
Unlike -isEqual:
(which has access to two instances being compared), -hash
must return a result "blindly", with access only to the data within a particular instance. Since it can't know what the hash is being used for, the result must be consistent for all possible instances that should be considered equal/identical, and must always agree with -isEqual:
. (Edit: This has been debunked by the answers below, and it certainly makes life easier.) Further, writing good hash functions is non-trivial — guaranteeing uniqueness is a challenge, especially when you only have an NSUInteger (32/64 bits) in which to represent it.
Questions
- Are there best practices when implementing
equality comparisons -hash
for collections?
- Are there any peculiarities to plan for in Objective-C and Cocoa-esque collections?
- Are there any good approaches for unit testing
-hash
with a reasonable degree of confidence?
- Any suggestions on implementing
-hash
to agree with -isEqual:
for collections containing elements of arbitrary types? What pitfalls should I know about? (Edit: Not as problematic as I first thought — as @kperryua points out, "equal -hash
values do not imply -isEqual:
".)
Edit: I should have clarified that I'm not confused about how to implement -isEqual: or -isEqualTo...: for collections, that's straightforward. I think my confusion stemmed mainly from (mistakenly) thinking that -hash MUST return a different value if -isEqual: returns NO. Having done cryptography in the past, I was thinking that hashes for different values MUST be different. However, the answers below made me realize that a "good" hash function is really about minimizing bucket collisions and chaining for collections that use -hash
. While unique hashes are preferable, they are not a strict requirement.
发布评论
评论(3)
我认为尝试提出一些普遍有用的哈希函数来为集合生成唯一的哈希值是徒劳的。 U62 组合所有内容的散列的建议不会很好地扩展,因为它使散列函数为 O(n)。 哈希函数实际上应该是 O(1) 以确保良好的性能,否则哈希的目的就落空了。 (考虑 plist 的常见 Cocoa 构造,它们是包含数组和其他字典的字典,可能令人作呕。如果集合的哈希函数为 O( n).)
我的建议是不要太担心集合的哈希值。 正如您所说,
-isEqual:
意味着相等的-hash
值。 另一方面,相等的-hash
值不并不意味着-isEqual:
。 这一事实为您提供了很大的余地来创建简单的哈希。如果你真的担心碰撞(并且你有现实世界情况的具体测量证据证明这是值得担心的),你仍然可以在某种程度上遵循 U62 的建议。 例如,您可以获取集合中第一个和/或最后一个元素的哈希值,并将其与集合的
-count
等组合起来。 这足以提供一个像样的哈希值。我希望这至少能回答您的一个问题。
至于第一点:实现
-isEqual:
是非常简单的。 您枚举内容,并检查每个元素的 isEqual: 。有一点需要注意,它可能会影响您决定对集合的
-hash
函数执行的操作。 您的集合的客户还必须了解管理-isEqual:
和-hash
的规则。 如果您在集合的-hash
中使用内容的-hash
,则如果内容的isEqual:
和-,您的集合将会中断hash
不同意。 当然,这是客户的错,但这是反对将-hash
基于集合内容的另一个论据。第 2 点有点模糊。 不确定你在那里有什么想法。
I think trying to come up with some generally useful hash function that will generate unique hash values for collections is an exercise in futility. U62's suggestion of combining the hashes of all the contents will not scale well, as it makes the hash function O(n). Hash functions should really be O(1) to ensure good performance, otherwise the purpose of the hash is defeated. (Consider the common Cocoa construct of plists, which are dictionaries containing arrays and other dictionaries, potentially ad nauseum. Attempting to take the hash of the top-level dictionary of a large plist would be excruciatingly slow if the collections' hash functions were O(n).)
My suggestion would be not to worry a great deal about a collection's hash. As you stated,
-isEqual:
implies equal-hash
values. On the other hand, equal-hash
values do not imply-isEqual:
. That fact gives you a lot of leeway to create a simple hash.If you're really worried about collisions though (and you have proof in concrete measurements of real-world situations that confirm it is something to be worried about), you could still follow U62's advice to some degree. For example, you could take the hash of, say, the first and/or last element in the collection, and combine that with, say, the
-count
of the collection. That be enough to provide a decent hash.I hope that answers at least one of your questions.
As for No. 1: Implementing
-isEqual:
is pretty cut and dry. You enumerate the contents, and check isEqual: on each of the elements.There is one thing to be careful of that may affect what you decide to do for your collections'
-hash
functions. Clients of your collections must also understand the rules governing-isEqual:
and-hash
. If you use the contents'-hash
in your collection's-hash
, your collection will break if the contents'isEqual:
and-hash
don't agree. It's the client's fault, of course, but that's another argument against basing your-hash
off of the collection's contents.No. 2 is kind of vague. Not sure what you have in mind there.
如果两个集合包含相同的元素,则应将其视为相等,并且如果集合是有序的,则元素的顺序相同。
关于集合的哈希值,以某种方式组合元素的哈希值(对它们进行异或或对它们进行模加)应该足够了。 请注意,虽然规则规定根据 IsEqual 相等的两个对象需要返回相同的哈希值,但相反的情况并不成立:虽然哈希值的唯一性是可取的,但对于解决方案的正确性来说这并不是必需的。 因此,有序集合不需要考虑元素的顺序。
顺便说一句,苹果文档的摘录是必要的限制。 一个对象无法在变异时保持相同的哈希值,同时也确保具有相同值的对象具有相同的哈希值。 这适用于最简单的对象和集合。 当然,通常只有当对象位于使用散列来组织其元素的容器内时,对象的散列才会发生变化。 所有这一切的结果是,可变集合在放置在另一个容器中时不应发生变异,但任何具有真正哈希函数的对象也不应发生变异。
Two collections should be considered equal if they contain the same elements, and further if the collections are ordered, that the elements are in the same order.
On the subject of hashes for collections, it should be enough to combine the hashes of the elements in some way (XOR them or modulo add them). Note that while the rules state that two objects that are equal according to IsEqual need to return the same hash, the opposite does not hold : Although uniqueness of hashes is desireable, it is not necessary for correctness of the solution. Thus an ordered collection need not take account of the order of the elements.
The excerpt from the Apple documentation is a necessary restriction by the way. An object could not maintain the same hash value under mutation while also ensuring that objects with the same value have the same hash. That applies for the simplest of objects as well as collections. Of course it only usually matters that an object's hash changes when it is inside a container that uses the hash to organise it's elements. The upshot of all this is that mutable collections shouldn't mutate when placed inside another container, but then neither should any object that has a true hash function.
我对 NSArray 和 NSMutableArray 默认哈希实现做了一些调查,(除非我误解了某些东西)它看起来像苹果不遵循他们自己的规则:
这是我的测试代码
输出是:
所以它看起来像 NSArray 和 NSMutableArray 上的 Hash 方法的默认实现是数组的计数,它不关心它是否在集合中或不是。
I have done some investigation into the NSArray and NSMutableArray default hash implementation and (unless I have misunderstood something) it seams like Apple do not follow thier own rules:
Here is my test code
The output is:
So it seams like the default implementation for the Hash method on both NSArray and NSMutableArray is the count of the array and it dosn't care if its inside a collection or not.