如何根据对象的内容生成唯一的哈希码?
我需要根据对象的内容生成一个唯一的哈希代码,例如 DateTime(2011,06,04) 应等于 DateTime(2011,06,04)。
- 我无法使用 .GetHashCode() 因为它可能会为具有不同内容的对象生成相同的哈希代码。
- 我无法使用 ObjectIDGenerator 中的 .GetID,因为它为具有相同内容的对象生成不同的哈希代码。
- 如果该对象包含其他子对象,则需要递归检查这些子对象。
- 它需要在集合上工作。
我需要写这个的原因是什么?我正在使用 PostSharp 编写一个缓存层。
更新
我想我可能问了错误的问题。正如 Jon Skeet 指出的那样,为了安全起见,我在缓存键中需要与对象中潜在数据的组合一样多的唯一组合。因此,最好的解决方案可能是使用反射构建一个长字符串,对对象的公共属性进行编码。对象不是太大,所以这是非常快速和高效的:
- 构造缓存键是高效的(只需将对象的公共属性转换为大字符串)。
- 检查缓存命中(比较两个字符串)非常有效。
I need to generate a unique hash code for an object, based on its contents, e.g. DateTime(2011,06,04) should equal DateTime(2011,06,04).
- I cannot use .GetHashCode() because it might generate the same hash code for objects with different contents.
- I cannot use .GetID from ObjectIDGenerator as it generates a different hash code for objects with the same contents.
- If the object contains other sub-objects, it needs to recursively check these.
- It needs to work on collections.
The reason I need to write this? I'm writing a caching layer using PostSharp.
Update
I think I may have been asking the wrong question. As Jon Skeet pointed out, to be on the safe side, I need as many unique combinations in the cache key as there are combinations of potential data in the object. Therefore, the best solution might be to build up a long string that encodes the public properties for the object, using reflection. The objects are not too large so this is very quick and efficient:
- Its efficient to construct the cache key (just convert the public properties of the object into a big string).
- Its efficient to check for a cache hit (compare two strings).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
来自评论:
这似乎是一个不寻常的要求,但既然这是您的要求,让我们来计算一下。
假设您每年制造 10 亿个独特的物体(每秒 30 个),持续 10 万亿万亿年。您正在创建 1049 个独特的对象。计算数学很容易; 当哈希的位大小小于 384 时,该时间内至少发生一次哈希冲突的概率高于十分之一18。
因此,您至少需要384 位哈希码以获得您所需的唯一性级别。这是一个方便的大小,为 12 个 int32。如果您打算每秒制作超过 30 个物体,或者希望概率小于十分之一18,那么就需要更多位。
为什么有这么严格的要求?
如果我有您提出的要求,我会这样做。第一个问题是将每个可能的数据转换为自描述的位序列。如果您已有序列化格式,请使用它。如果没有,请发明一种可以序列化您对散列感兴趣的所有可能对象的工具。
然后,为了对对象进行哈希处理,将其序列化为字节数组,然后通过 SHA-384 或 SHA-512 哈希算法运行该字节数组。这将产生专业加密级 384 或 512 位哈希值,即使面对试图强制碰撞的攻击者,该哈希值也被认为是唯一的。如此多的比特应该足以确保在十万亿万亿年的时间范围内发生低概率的碰撞。
From a comment:
That seems like an unusual requirement but since that's your requirement, let's do the math.
Let's suppose you make a billion unique objects a year -- thirty per second -- for 10 trillion trillion trillion years. That's 1049 unique objects you're creating. Working out the math is quite easy; the probability of at least one hash collision in that time is above one in 1018 when the bit size of the hash is less than 384.
Therefore you'll need at least a 384 bit hash code to have the level of uniqueness that you require. That's a convenient size, being 12 int32s. If you're going to be making more than 30 objects a second or want the probability to be less than one in 1018 then more bits will be necessary.
Why do you have such stringent requirements?
Here's what I would do if I had your stated requirements. The first problem is to convert every possible datum into a self-describing sequence of bits. If you have a serialization format already, use that. If not, invent one that can serialize all possible objects that you are interested in hashing.
Then, to hash the object, serialize it into a byte array and then run the byte array through the SHA-384 or SHA-512 hashing algorithm. That will produce a professional-crypto-grade 384 or 512 bit hash that is believed to be unique even in the face of attackers trying to force collisions. That many bits should be more than enough to ensure low probability of collision in your ten trillion trillion trillion year timeframe.
如果您需要创建一个唯一哈希码,那么您基本上是在谈论一个可以代表您的类型可以拥有的尽可能多的状态的数字。我相信,对于
DateTime
来说,比意味着采用 Ticks 值和DateTimeKind
。您可能可以假设 Ticks 属性的前两位为零,并使用它们来存储类型。据我所知,这意味着直到 7307 年你都没有问题:
If you need to create a unique hash code, then you're basically talking about a number which can represent as many states as your type can have. For
DateTime
than means taking the Ticks value and theDateTimeKind
, I believe.You may be able to get away with assuming that the top two bits of the
Ticks
property are going to be zero, and using those to store the kind. That means you're okay up until the year 7307 as far as I can tell:您在这里谈论的不是哈希码,您需要一个状态的数字表示 - 为了使其唯一,它可能必须非常大,具体取决于您的对象结构。
为什么不使用常规的哈希码,并通过实际比较对象来处理冲突?这似乎是最合理的做法。
You are not talking about a hash code here, you need a number representation of your state - for that to be unique it might have to be incredibly large depending on your object structure.
Why don't you use a regular hashcode instead, and handle collisions by actually comparing the objects? That seems to be the most reasonable approach.
对 BrokenGlass 答案的补充,我已投票并认为该答案是正确的:
使用
GetHashCode
/Equals
方法意味着如果两个对象散列到相同的值,则“将依靠其Equals
实现来告诉您它们是否相等。除非这些对象重写
Equals
(这实际上意味着它们实现IEquatable
,其中T
是它们的类型),的默认实现code>Equals
将进行参考比较。这反过来意味着您的缓存会错误地错过业务意义上“相等”但独立构造的对象。仔细考虑缓存的使用模型,因为如果您最终将其用于不
IEquatable
的类,并且以您期望检查非引用的方式使用它,如果对象相等,那么缓存将变得完全无用。An addition to BrokenGlass' answer, which I have voted up and consider to be correct:
Using the
GetHashCode
/Equals
method means that if two objects hash to the same value you 'll be relying in theirEquals
implementation to tell you if they are equivalent.Unless these objects override
Equals
(which would practically mean that they implementIEquatable<T>
whereT
is their type), the default implementation ofEquals
is going to do a reference comparison. This in turn means that your cache would mistakenly yield a miss for objects which are "equal" in the business sense but have been constructed independently.Consider the usage model for your cache carefully, because if you end up using it for classes that are not
IEquatable
and in a manner where you expect to be checking non-reference-equal objects for equality, the cache will turn out to be completely useless.哈希码发生冲突是很正常的。如果您的哈希码具有固定长度(在标准 .NET 哈希码的情况下为 32 位),那么您必然会与范围大于此值的任何值发生冲突(例如,long 为 64 位;n*64 n 个长整型数组的位等)。
事实上,对于任何具有有限长度 N 的哈希码,超过 N 个元素的集合总是会发生冲突。
你所要求的在一般情况下是不可行的。
It's quite normal for a hash code to have collisions. If your hash code has a fixed length (32 bits in the case of the standard .NET hash code), then you're bound to have collisions with any values whose range is bigger than this (e.g. 64 bits for long; n*64 bits for an array of n longs etc).
In fact for any hash code with a finite length N, there will always be collisions for collections of more than N elements.
What you're asking for isn't feasible in the general case.
我们有完全相同的要求,这是我想出的功能。这对于我们需要缓存的对象类型非常有效
,例如,如果我们有类似
上面方法生成的缓存键,
We had exactly the same requirement and here is the function I came up with. This is what works well for types of objects we need to cache
So for example if we have something like this
Cache key generated by method above will be
您可以从序列化为 json 的对象计算 ex md5 sum (或类似的东西)。
如果您只希望某些属性重要,您可以在途中创建匿名对象:
我用它来检查是否有人弄乱了我存储基于许可证的数据的数据库。您还可以在 json 变量中附加一些种子以使事情变得复杂
You can calculate ex md5 sum (or something like that) from object serialized to json.
If you want only some properties to matter, you can create anonymous object on the way:
I use that for checking if someone messed with my database storing license based data. You can also append json variable with some seed to complicate stuff
这种扩展方法适合您的目的吗?如果对象是值类型,则仅返回其哈希码。否则,它会递归地获取每个属性的值并将它们组合成一个散列。
Would this extension method suit your purposes? If the object is a value type, it just returns its hash code. Otherwise, it recursively gets the value of each property and combines them into a single hash.
这里的一些答案会序列化为 JSON 并从中生成 MD5 哈希值。这在大多数情况下都有效,除非您有集合并且项目顺序不同。由于集合顺序的不同,同一对象可能会生成不同的哈希值。
我想出的解决方案如下,我序列化为 JSON(使用 Newtonsoft Json.NET),并通过对每个项目进行散列并按该散列进行排序来对任何子集合进行排序。这给了我们一个确定性的序列化表示,我们可以在其上生成哈希。
可能有一些场景我没有完全考虑到,但这适用于大多数常见场景的复杂对象的嵌套集合。
Some of the answers here serialize to JSON and generate an MD5 hash from that. This works most the time except when you have collections and the item order is different. The same object could generate different hashes because of the collection order difference.
The solution I came up with is below where I serialize to JSON (using Newtonsoft Json.NET) and order any child collections by hashing each of the items and sorting by that hash. This gives us a deterministic serialized representation we can generate a hash on.
There might be some scenarios I'm not fully accounting for, but this works for the nested collections of complex objects for most common scenarios.
通用扩展方法
Generic Extension Method