如何计算字符串列表的良好哈希码?
背景:
- 我有一个简短的字符串列表。
- 字符串的数量并不总是相同,但几乎总是“少数”的数量级。
- 我们的数据库将这些字符串存储在第二个规范化表中。
- 这些字符串一旦写入,就永远改变。到数据库。
我们希望能够在查询中快速匹配这些字符串,而不会因为进行大量连接而影响性能。
因此,我正在考虑将所有这些字符串的哈希码存储在主表中并将其包含在我们的索引中,因此只有当哈希码匹配时数据库才会处理连接。
那么如何获得一个好的哈希码呢?我可以:
- 将所有字符串的哈希码一起异
- 或 异或将每个字符串后的结果相乘(例如乘以 31)
- 将所有字符串放在一起,然后获取哈希码
- 其他方式
那么人们怎么想?
最后,我只是连接字符串并计算连接的哈希码,因为它很简单并且工作得很好。
(如果您关心的话,我们正在使用 .NET 和 SqlServer)
错误!错误!
引用 Eric Lippert 的 GetHashCode 指南和规则
文档 System.String.GetHashCode 注释 特别是两个相同的 字符串可以有不同的哈希码 在不同版本的 CLR 中,以及 事实上他们确实如此。不存储字符串 数据库中的哈希值并期望它们 永远一样,因为他们 不会的。
因此 String.GetHashcode() 不应该用于此目的。
Background:
- I have a short list of strings.
- The number of strings is not always the same, but are nearly always of the order of a “handful”
- In our database will store these strings in a 2nd normalised table
- These strings are never changed once they are written to the database.
We wish to be able to match on these strings quickly in a query without the performance hit of doing lots of joins.
So I am thinking of storing a hash code of all these strings in the main table and including it in our index, so the joins are only processed by the database when the hash code matches.
So how do I get a good hashcode? I could:
- Xor the hash codes of all the string together
- Xor with multiply the result after each string (say by 31)
- Cat all the string together then get the hashcode
- Some other way
So what do people think?
In the end I just concatenate the strings and compute the hashcode for the concatenation, as it is simple and worked well enough.
(If you care we are using .NET and SqlServer)
Bug!, Bug!
Quoting from Guidelines and rules for GetHashCode by Eric Lippert
The documentation for
System.String.GetHashCode notes
specifically that two identical
strings can have different hash codes
in different versions of the CLR, and
in fact they do. Don't store string
hashes in databases and expect them to
be the same forever, because they
won't be.
So String.GetHashcode() should not be used for this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
标准的java实践,就是简单的写
Standard java practise, is to simply write
我认为没有理由不连接字符串并计算连接的哈希码。
打个比方,假设我想计算一个内存块的 MD5 校验和,我不会将该块分割成更小的块并为它们计算单独的 MD5 校验和,然后用一些临时方法将它们组合起来。
I see no reason not to concatenate the strings and compute the hashcode for the concatenation.
As an analogy, say that I wanted to compute a MD5 checksum for a memory block, I wouldn't split the block up into smaller pieces and compute individual MD5 checksums for them and then combine them with some ad hoc method.
您的第一个选项唯一的不便是
(String1, String2)
生成与(String2, String1)
相同的哈希码。如果这不是问题(例如,因为您有修复订单),那就没问题。“将所有字符串放在一起,然后获取哈希码”对我来说似乎更自然、更安全。
更新:正如评论所指出的,这有一个缺点,即列表 ("x", "yz") 和 ("xy","z") 会给出相同的哈希值。为了避免这种情况,您可以使用不能出现在字符串内部的字符串分隔符来连接字符串。
如果字符串很大,您可能更愿意对每个字符串进行散列,对散列码进行分类并重新散列结果。更多的CPU,更少的内存。
Your first option has the only inconvenience of
(String1, String2)
producing the same hashcode of(String2, String1)
. If that's not a problem (eg. because you have a fix order) it's fine."Cat all the string together then get the hashcode" seems the more natural and secure to me.
Update: As a comment points out, this has the drawback that the list ("x", "yz") and ("xy","z") would give the same hash. To avoid this, you could join the strings with a string delimiter that cannot appear inside the strings.
If the strings are big, you might prefer to hash each one, cat the hashcodes and rehash the result. More CPU, less memory.
我脑海中浮现的另一种方式是,根据索引使用旋转散列进行链式异或:
编辑:阅读有效java中给出的解释,我认为杰夫的代码会更有效。
Another way that pops in my head, chain xors with rotated hashes based on index:
edit: reading the explanation given in effective java, I think geoff's code would be much more efficient.
基于 SQL 的解决方案可以基于 checksum 和 checksum_agg 函数。如果我没理解错的话,你会得到类似的结果:
给定项目 (MyTableId) 的各种字符串存储在 MyChildTable 中。要计算和存储反映这些(永远不会更改)字符串的校验和,应该像这样工作:
我相信这是与顺序无关的,因此字符串“The Quick Brown”将产生与“Brown The Quick”相同的校验和”。
A SQL-based solution could be based on the checksum and checksum_agg functions. If I'm following it right, you have something like:
with the various strings for a given item (MyTableId) stored in MyChildTable. To calculate and store a checksum reflecting these (never-to-be-changed) strings, something like this should work:
I believe this is order-independant, so strings "The quick brown" would produce the same checksum as "brown The quick".
我希望这是不必要的,但是由于您没有提到任何听起来像是您只是使用哈希码进行第一次检查,然后再验证字符串实际上是否相等的内容,所以我觉得有必要警告您:
哈希码相等!=值相等
会有很多组字符串产生相同的哈希码,但并不总是相等。
I hope this is unnecessary, but since you don't mention anything which sounds like you're only using the hashcodes for a first check and then later verifying that the strings are actually equal, I feel the need to warn you:
Hashcode equality != value equality
There will be lots of sets of strings which yield the identical hashcode, but won't always be equal.
所以我明白,您实际上有一些需要通过哈希码来识别的字符串集,并且您需要在其中识别的那组字符串永远不会改变?
如果是这种情况,这并不特别重要,只要您使用的方案为不同的字符串/字符串组合提供唯一的数字即可。我首先连接字符串并计算 String.hashCode() 并查看最终是否得到唯一的数字。如果不这样做,那么您可以尝试:
64 位哈希码的可能方案如下:
因此,基于数值配方中建议的值的实现将是:
上面正在初始化我们的随机数数组。我们使用 XORShift 生成器,但我们实际上可以使用任何质量相当好的随机数生成器(使用特定种子创建 SecureRandom(),然后调用 nextLong() 就可以了)。然后,生成哈希码:
要考虑的指南是,给定 n 位哈希码,平均而言,您需要生成 2^(n/2) 个字符串的哈希值,然后才能获得碰撞。或者换句话说,对于 64 位哈希,您预计在大约 40 亿个字符串后会发生冲突(因此,如果您要处理多达几百万个字符串,则发生冲突的可能性可以忽略不计) )。
另一种选择是 MD5,它是一个非常强大的哈希值(实际上安全),但它是 128 位哈希值,因此您有一个必须处理 128 位值的轻微缺点。我想说 MD5 对于这些目的来说有点过分了——正如我所说,使用 64 位散列,您可以相当安全地处理几百万个字符串。
(抱歉,我应该澄清一下——MD5 被设计为一种安全散列,只是后来发现它不安全。“安全”散列是指给定特定散列时,故意构造会导致以下情况的输入是不可行的:在某些情况下——但不像我所理解的——你可能需要这个属性,另一方面,如果你正在处理用户输入的数据——即a。恶意用户可能会故意尝试混淆您的系统。您可能还会对我过去写的以下内容感兴趣:
So I understand, you effectively have some set of strings that you need to identify by hash code, and that set of strings that you need to identify among will never change?
If that's the case, it doesn't particularly matter, so long as the scheme you use gives you unique numbers for the different strings/combinations of strings. I would start by just concatenating the strings and calculating the String.hashCode() and seeing if you end up with unique numbers. If you don't, then you could try:
A possible scheme for a 64-bit hash code is as follows:
So an implementation based on values suggested in Numerical Recipes would be:
The above is initialising our array of random numbers. We use an XORShift generator, but we could really use any fairly good-quality random number generator (creating a SecureRandom() with a particular seed then calling nextLong() would be fine). Then, to generate a hash code:
A guide to consider is that given a hash code of n bits, on average you'd expect to have to generate hashes of in the order of 2^(n/2) strings before you get a collision. Or put another way, with a 64-bit hash, you'd expect a collision after around 4 billion strings (so if you're dealing with up to, say, a couple of million strings, the chances of a collision are pretty negligible).
Another option would be MD5, which is a very strong hash (practically secure), but it is a 128-bit hash, so you have the slight disadvantage of having to deal with 128-bit values. I would say MD5 is overkill for these purposes-- as I say, with a 64-bit hash, you can deal fairly safely with in the order of a few million strings.
(Sorry, I should clarify -- MD5 was designed as a secure hash, it's just that it's since found not to be secure. A "secure" hash is one where given a particular hash it's not feasible to deliberately construct input that would lead to that hash. In some circumstances-- but not as I understand in yours-- you would need this property. You might need it, on the other hand, if the strings you're dealing with a user-input data-- i.e. a malicious user could deliberately try to confuse your system. You might also be interetsed in the following I've written in the past:
使用
GetHashCode()
并不适合组合多个值。问题是对于字符串来说,哈希码只是一个校验和。对于相似的值来说,这留下了很少的熵。例如,为(“abc”,“bbc”)添加哈希码将与(“abd”,“abc”)相同,从而导致冲突。如果您需要绝对确定,您可以使用真正的哈希算法,例如 SHA1、MD5 等。唯一的问题是它们是块函数,很难快速比较哈希值是否相等。相反,请尝试 CRC 或 FNV1 哈希。 FNV1 32 位非常简单:
Using the
GetHashCode()
is not ideal for combining multiple values. The problem is that for strings, the hashcode is just a checksum. This leaves little entropy for similar values. e.g. adding hashcodes for ("abc", "bbc") will be the same as ("abd", "abc"), causing a collision.In cases where you need to be absolutely sure, you'd use a real hash algorithm, like SHA1, MD5, etc. The only problem is that they are block functions, which is difficult to quickly compare hashes for equality. Instead, try a CRC or FNV1 hash. FNV1 32-bit is super simple:
您可以使用以下方法来聚合哈希码:
http ://docs.oracle.com/javase/7/docs/api/java/util/Objects.html#hash(java.lang.Object...)
You can use the following method to aggregate hash codes:
http://docs.oracle.com/javase/7/docs/api/java/util/Objects.html#hash(java.lang.Object...)
如果您碰巧使用 Java,则可以创建一个字符串数组(或将集合转换为数组),然后使用
Arrays.hashCode()
,如文档 此处。If you happen to use Java, you can create an array of strings (or convert a collection to an array), and then use
Arrays.hashCode()
as documented here.让我们解决您的根本问题。
不要使用哈希码。只需为每个字符串添加一个整数主键
Let's solve your root problem.
Don't use a hashcode. Just add a integer primary key for each String