译文 · 原文: Friday Q&A 2010-06-18: Implementing Equality and Hashing · 作者 Mike Ash
原文:https://www.mikeash.com/pyblog/friday-qa-2010-06-18-implementing-equality-and-hashing.html 发布:2010-06-18 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样
欢迎回到周五问答的延期刊物。由于 WWDC 的推迟,本次发布推迟了一周,但现在终于又到了分享的时候。本周,我将探讨 Cocoa 中相等性(equality)和哈希(hashing)的实现,这个话题由一位读者提出。
相等性
对象相等性是一个基础概念,在各种场景中都会用到。在 Cocoa 中,它通过 isEqual: 方法实现。即使像 [array indexOfObject:] 这样简单的操作也会用到它,因此让你的对象支持它非常重要。
它如此重要,以至于 Cocoa 实际上在 NSObject 上为我们提供了默认实现。这个默认实现只是比较指针。换句话说,一个对象只等于它自己,永远不会等于另一个对象。该实现在功能上等同于:
- (BOOL)isEqual: (id)other { return self == other; }自定义相等性的实现(Implementing Custom Equality)
有时你需要更深层的相等性实现。对象(尤其是通常所称的” 值对象”(value object))很常见的情况是:它们虽然作为不同的对象存在,但在逻辑上却是相等的。例如:
// use mutable strings because that guarantees distinct objects NSMutableString *s1 = [NSMutableString stringWithString: @"Hello, world"]; NSMutableString *s2 = [NSMutableString stringWithFormat: @"%@, %@", @"Hello", @"world"]; BOOL equal = [s1 isEqual: s2]; // gives you YES! MyClass *c1 = ...; MyClass *c2 = ...; BOOL equal = [c1 isEqual: c2];测试对象的相等性在大多数情况下相当直接。收集你的类的相关属性,并对它们全部进行相等性测试。如果其中任何一个不相等,就返回 NO。否则,返回 YES。
这里有一个微妙之处,即你对象的所属类(class)本身也是需要测试的一个重要属性。将一个 MyClass 实例与一个 NSString 实例进行相等性测试是完全合法的,但该比较结果永远不应该返回 YES(当然,除非 MyClass 是 NSString 的子类)。
一个没那么微妙的点是,确保你只测试那些对相等性真正重要的属性。像缓存(caches)这样不影响对象外部可见值(externally-visible value)的状态不应被测试。
假设你的类是这样的:
@interface MyClass : NSObject { int _length; char *_data; NSString *_name; NSMutableDictionary *_cache; } - (BOOL)isEqual: (id)other { return ([other isKindOfClass: [MyClass class]] && [other length] == _length && memcmp([other data], _data, _length) == 0 && [[other name] isEqual: _name]) // note: no comparison of _cache }如果你熟悉哈希表的工作原理,可能会想跳过接下来的一两段。
哈希表(hash table)本质上是一个带有特殊索引机制的大数组。对象被放入数组中,其索引对应于它的哈希值(hash)。哈希值本质上是根据对象的属性生成的一个伪随机数。其设计意图是使索引足够随机,以降低两个对象拥有相同哈希值的可能性,同时又要保证该值完全可重现。当插入一个对象时,其哈希值用于确定存放位置。当查找一个对象时,其哈希值用于确定查找位置。
更正式地说,一个对象的哈希值被定义为:如果两个对象相等,那么它们具有相同的哈希值。请注意,反之则不成立,也不可能成立:两个对象可以拥有相同的哈希值却并不相等。你应尽量避免这种情况,因为当两个不相等的对象拥有相同的哈希值(称为冲突(collision))时,哈希表就必须采取特殊措施来处理,这会降低速度。然而,可以证明完全避免冲突是不可能的。
在 Cocoa 中,哈希功能通过 hash 方法来实现,其方法签名如下:
- (NSUInteger)hash; - (NSUInteger)hash { return (NSUInteger)self; }实现自定义哈希
由于 hash 的语义特性,如果你重写了 isEqual:,那么你必须同时重写 hash。如果不这样做,就会面临这样的风险:两个对象在相等判断下是相等的,但它们的哈希值(hash)却不同。如果你将这样的对象用于字典(dictionary)、集合(set)或其他使用哈希表(hash table)的容器中,那么就可能会引发意想不到的问题。
由于对象的哈希值定义与其相等判断的定义紧密相关,因此 hash 方法的实现也应紧密跟随 isEqual: 的实现逻辑。
这里的一个例外是,在定义 hash 时,没有必要包含你对象所属的类。这基本上是 isEqual: 中的一个保障措施,确保后续的相等性检查在用于不同类型的对象时仍然有意义。仅凭哈希不同的属性并使用不同的数学方法来组合它们,你的哈希值就很可能与另一个类的哈希值截然不同。
生成属性哈希值 测试属性的相等性通常是直接的,但为它们计算哈希值则并非总是如此。如何对一个属性进行哈希,取决于它是什么类型的对象。
对于一个数值型属性,其哈希值可以直接就是该数值本身。
对于对象类型的属性,你可以给该对象发送 hash 消息,然后使用它返回的值。
对于类似数据类型的属性,你需要使用某种哈希算法来生成哈希值。你可以使用 CRC32,或者甚至是像 MD5 这样完全大材小用的算法。另一种方法 —— 虽然速度稍慢但使用起来更方便 —— 是将数据包装成 NSData 对象并请求其哈希值,这本质上是将计算工作转交给了 Cocoa 框架。在上面的例子中,你可以这样计算 _data 的哈希值:
[[NSData dataWithBytes: _data length: _length] hash]最简单的方法是直接相加它们,或者利用按位异或(bitwise xor)的特性。然而,这会损害你哈希值的唯一性(hash’s uniqueness),因为这些操作是可交换的(symmetric),这意味着不同属性之间的区分度会丢失。举个例子,考虑一个包含姓和名的对象,其哈希实现如下:
- (NSUInteger)hash { return [_firstName hash] ^ [_lastName hash]; }如何最好地合并哈希值(hashes)是一个复杂的主题,没有唯一的答案。然而,任何一种不对称的合并方式都是一个良好的开端。我喜欢在使用异或(xor)的基础上加入位旋转(bitwise rotation)来合并它们:
#define NSUINT_BIT (CHAR_BIT * sizeof(NSUInteger)) #define NSUINTROTATE(val, howmuch) ((((NSUInteger)val) << howmuch) | (((NSUInteger)val) >> (NSUINT_BIT - howmuch)))
- (NSUInteger)hash { return NSUINTROTATE([_firstName hash], NSUINT_BIT / 2) ^ [_lastName hash]; } - (NSUInteger)hash { NSUInteger dataHash = [[NSData dataWithBytes: _data length: _length] hash]; return NSUINTROTATE(dataHash, NSUINT_BIT / 2) ^ [_name hash]; }子类化注意事项
在子类化一个实现了自定义相等性(equality)与哈希(hashing)的类时必须格外小心。特别是,你的子类不应暴露任何新的、影响相等性判断的属性(properties)。如果确实暴露了,那么它必须不与任何父类(superclass)的实例比较为相等。
要理解原因,可以考虑对姓 / 名类进行子类化,并添加一个生日属性,同时将其纳入相等性计算。然而,在与父类实例比较相等性时又无法包含它,因此其相等性方法将会是这样的:
- (BOOL)isEqual: (id)other { // if the superclass doesn't like it then we're not equal if(![super isEqual: other]) return NO;
// if it's not an instance of the subclass, then trust the superclass // it's equal there, so we consider it equal here if(![other isKindOfClass: [MySubClass class]]) return YES;
// it's an instance of the subclass, the superclass properties are equal // so check the added subclass property return [[other birthday] isEqual: _birthday]; }现在考虑一个” John Smith” 子类的实例,生日为 6/7/1994,我称之为 C。C 不等于 B,这符合我们的预期。C 等于 A,这也符合预期。但现在有一个问题。A 既等于 B 又等于 C,但 B 和 C 彼此不相等!这打破了相等运算符(equality operator)的标准传递性(transitivity),并导致极其意想不到的结果。
一般来说,这不应该是个大问题。如果你的子类添加了影响对象相等性(object equality)的属性,这可能表明你的层次结构存在设计问题。与其用奇怪的 isEqual:(方法实现)实现来绕过它,不如考虑重新设计你的类层次结构(class hierarchy)。
关于字典的说明
如果你想将对象用作 NSDictionary(字典类)中的键,你需要实现哈希(hashing)和相等性(equality),但你还需要实现 - copyWithZone:(方法实现)。这样做的技巧超出了今天文章的范围,但你应该意识到在这种情况下你需要做更多工作。
结论 Cocoa 提供了相等性(equality)和哈希处理(hashing)的默认实现,这对许多对象都适用。但如果你希望即使对象在内存中是不同的实例也被视为相等,就需要额外做些工作。幸运的是,这并不困难,一旦你实现了这些方法,你的类就能与许多 Cocoa 集合类无缝配合工作。
本周的内容就到这里。两周后再见,届时将带来另一期内容。在那之前,请继续发送你希望讨论的主题。周五问答(Friday Q & A)由用户的投稿驱动,所以如果你对某个主题有想法,请发送给我们。
Original (English)
Source: https://www.mikeash.com/pyblog/friday-qa-2010-06-18-implementing-equality-and-hashing.html
Welcome back to a late edition of Friday Q&A. WWDC pushed the schedule back one week, but it’s finally time for another one. This week, I’m going to discuss the implementation of equality and hashing in Cocoa, a topic suggested by a reader.
Equality Object equality is a fundamental concept that gets used all over the place. In Cocoa, it’s implemented with the isEqual: method. Something as simple as [array indexOfObject:] will use it, so it’s important that your objects support it.
It’s so important that Cocoa actually gives us a default implementation of it on NSObject. The default implementation just compares pointers. In other words, an object is only equal to itself, and is never equal to another object. The implementation is functionally identical to:
- (BOOL)isEqual: (id)other { return self == other; }Implementing Custom Equality Sometimes you need a deeper implementation of equality. It’s common for objects, typically what you might refer to as a “value object”, to be distinct from another object but be logically equal to it. For example:
// use mutable strings because that guarantees distinct objects NSMutableString *s1 = [NSMutableString stringWithString: @"Hello, world"]; NSMutableString *s2 = [NSMutableString stringWithFormat: @"%@, %@", @"Hello", @"world"]; BOOL equal = [s1 isEqual: s2]; // gives you YES! MyClass *c1 = ...; MyClass *c2 = ...; BOOL equal = [c1 isEqual: c2];Testing for equality is fairly straightforward most of the time. Gather up the relevant properties of your class, and test them all for equality. If any of them are not equal, then return NO. Otherwise, return YES.
One subtle point with this is that the class of your object is an important property to test as well. It’s perfectly valid to test a MyClass for equality with an NSString, but that comparison should never return YES (unless MyClass is a subclass of NSString, of course).
A somewhat less subtle point is to ensure that you only test properties that are actually important to equality. Things like caches that do not influence your object’s externally-visible value should not be tested.
Let’s say your class looks like this:
@interface MyClass : NSObject { int _length; char *_data; NSString *_name; NSMutableDictionary *_cache; } - (BOOL)isEqual: (id)other { return ([other isKindOfClass: [MyClass class]] && [other length] == _length && memcmp([other data], _data, _length) == 0 && [[other name] isEqual: _name]) // note: no comparison of _cache }If you’re familiar with how hash tables work, you may want to skip the next paragraph or two.
A hash table is basically a big array with special indexing. Objects are placed into an array with an index that corresponds to their hash. The hash is essentially a pseudorandom number generated from the object’s properties. The idea is to make the index random enough to make it unlikely for two objects to have the same hash, but have it be fully reproducible. When an object is inserted, the hash is used to determine where it goes. When an object is looked up, its hash is used to determine where to look.
In more formal terms, the hash of an object is defined such that two objects have an identical hash if they are equal. Note that the reverse is not true, and can’t be: two objects can have an identical hash and not be equal. You want to try to avoid this as much as possible, because when two unequal objects have the same hash (called a collision) then the hash table has to take special measures to handle this, which is slow. However, it’s provably impossible to avoid it completely.
In Cocoa, hashing is implemented with the hash method, which has this signature:
- (NSUInteger)hash; - (NSUInteger)hash { return (NSUInteger)self; }Implementing Custom Hashing Because of the semantics of hash, if you override isEqual: then you must override hash. If you don’t, then you risk having two objects which are equal but which don’t have the same hash. If you use these objects in a dictionary, set, or something else which uses a hash table, then hilarity will ensue.
Because the definition of the object’s hash follows equality so closely, the implementation of hash likewise closely follows the implementation of isEqual:.
An exception to this is that there’s no need to include your object’s class in the definition of hash. That’s basically a safeguard in isEqual: to ensure the rest of the check makes sense when used with a different object. Your hash is likely to be very different from the hash of a different class simply by virtue of hashing different properties and using different math to combine them.
Generating Property Hashes Testing properties for equality is usually straightforward, but hashing them isn’t always. How you hash a property depends on what kind of object it is.
For a numeric property, the hash can simply be the numeric value.
For an object property, you can send the object the hash method, and use what it returns.
For data-like properties, you’ll want to use some sort of hash algorithm to generate the hash. You can use CRC32, or even something totally overkill like MD5. Another approach, somewhat less speedy but easy to use, is to wrap the data in an NSData and ask it for its hash, essentially offloading the work onto Cocoa. In the above example, you could compute the hash of _data like so:
[[NSData dataWithBytes: _data length: _length] hash]The easiest way is to simply add them together, or use the bitwise xor property. However, this can hurt your hash’s uniqueness, because these operations are symmetric, meaning that the separation between different properties gets lost. As an example, consider an object which contains a first and last name, with the following hash implementation:
- (NSUInteger)hash { return [_firstName hash] ^ [_lastName hash]; }How to best combine hashes is a complicated subject without any single answer. However, any asymmetric way of combining the values is a good start. I like to use a bitwise rotation in addition to the xor to combine them:
#define NSUINT_BIT (CHAR_BIT * sizeof(NSUInteger)) #define NSUINTROTATE(val, howmuch) ((((NSUInteger)val) << howmuch) | (((NSUInteger)val) >> (NSUINT_BIT - howmuch)))
- (NSUInteger)hash { return NSUINTROTATE([_firstName hash], NSUINT_BIT / 2) ^ [_lastName hash]; } - (NSUInteger)hash { NSUInteger dataHash = [[NSData dataWithBytes: _data length: _length] hash]; return NSUINTROTATE(dataHash, NSUINT_BIT / 2) ^ [_name hash]; }A Note on Subclassing You have to be careful when subclassing a class which implements custom equality and hashing. In particular, your subclass should not expose any new properties which equality is dependent upon. If it does, then it must not compare equal with any instances of the superclass.
To see why, consider a subclass of the first/last name class which includes a birthday, and includes that as part of its equality computation. It can’t include it when comparing equality with an instance of the superclass, though, so its equality method would look like this:
- (BOOL)isEqual: (id)other { // if the superclass doesn't like it then we're not equal if(![super isEqual: other]) return NO;
// if it's not an instance of the subclass, then trust the superclass // it's equal there, so we consider it equal here if(![other isKindOfClass: [MySubClass class]]) return YES;
// it's an instance of the subclass, the superclass properties are equal // so check the added subclass property return [[other birthday] isEqual: _birthday]; }Now consider an instance of the subclass for “John Smith” with a birthday of 6/7/1994, which I’ll call C. C is not equal to B, which is what we expect. C is equal to A, also expected. But now there’s a problem. A equals both B and C, but B and C do not equal each other! This breaks the standard transitivity of the equality operator, and leads to extremely unexpected results.
In general this should not be a big problem. If your subclass adds properties which influence object equality, that’s probably an indication of a design problem in your hierarchy anyway. Rather than working around it with weird implementations of isEqual:, consider redesigning your class hierarchy.
A Note on Dictionaries If you want to use your object as a key in an NSDictionary, you need to implement hashing and equality, but you also need to implement -copyWithZone:. Techniques for doing that are beyond the scope of today’s post, but you should be aware that you need to go a little bit further in that case.
Conclusion Cocoa provides default implementations of equality and hashing which work for many objects, but if you want your objects to be considered equal even when they’re distinct objects in memory, you have to do a bit of extra work. Fortunately, it’s not difficult to do, and once you implement them, your class will work seamlessly with many Cocoa collection classes.
That’s it for this week. Check back in two more weeks for another edition. Until then, keep sending me your requests for topics. Friday Q&A is driven by user submissions, so if you have an idea for a topic to cover here, please send it in.