良好的字符串哈希函数
我正在尝试为字符串想出一个好的哈希函数。我认为总结字符串中前五个字符的 unicode 值可能是个好主意(假设它有五个,否则在结束处停止)。这是一个好主意,还是一个坏主意?
我正在用 Java 做这件事,但我认为这不会产生太大的影响。
I'm trying to think up a good hash function for strings. And I was thinking it might be a good idea to sum up the unicode values for the first five characters in the string (assuming it has five, otherwise stop where it ends). Would that be a good idea, or is it a bad one?
I am doing this in Java, but I wouldn't imagine that would make much of a difference.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
通常哈希值不会求和,否则
stop
和pots
将具有相同的哈希值。并且您不会将其限制为前 n 个字符,因为否则 house 和 house 将具有相同的哈希值。
一般来说,散列采用值并将其乘以素数(使其更有可能生成唯一的散列),因此您可以执行以下操作:
Usually hashes wouldn't do sums, otherwise
stop
andpots
will have the same hash.and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.
Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:
如果是安全问题,您可以使用 Java 加密:
If it's a security thing, you could use Java crypto:
您可能应该使用 String.hashCode() 。
如果你真的想自己实现 hashCode:
仅使用前五个字符是一个坏主意。考虑一下分层名称,例如 URL:它们都将具有相同的哈希代码(因为它们都以“http://”开头,这意味着它们存储在哈希映射中的同一个存储桶下,表现出糟糕的性能。
这里是一个战争故事,根据“Effective Java”中的字符串 hashCode 进行解释:
You should probably use String.hashCode().
If you really want to implement hashCode yourself:
Using only the first five characters is a bad idea. Think about hierarchical names, such as URLs: they will all have the same hash code (because they all start with "http://", which means that they are stored under the same bucket in a hash map, exhibiting terrible performance.
Here's a war story paraphrased on the String hashCode from "Effective Java":
如果您在 Java 中执行此操作,那么您为什么要这样做?只需对字符串调用
.hashCode()
If you are doing this in Java then why are you doing it? Just call
.hashCode()
on the stringGuava 的
HashFunction
(javadoc)提供了不错的非加密-强哈希。Guava's
HashFunction
(javadoc) provides decent non-crypto-strong hashing.Nick提供的这个函数很好,但是如果使用new String(byte[] bytes)来转换为String,则失败。您可以使用此功能来做到这一点。
也许这可以帮助某人
This function provided by Nick is good but if you use new String(byte[] bytes) to make the transformation to String, it failed. You can use this function to do that.
May be this can help somebody
据传 FNV-1 是一个很好的字符串哈希函数。
对于长字符串(例如,超过 200 个字符),您可以通过 获得良好的性能MD4 哈希函数。作为一种加密功能,它大约在 15 年前就被破解了,但对于非加密目的来说,它仍然非常好,而且速度快得惊人。在 Java 环境中,您必须将 16 位 char 值转换为 32 位字,例如将这些值分组为对。 MD4 在 Java 中的快速实现可以在 sphlib。在课堂作业中可能有点矫枉过正,但在其他方面值得一试。
FNV-1 is rumoured to be a good hash function for strings.
For long strings (longer than, say, about 200 characters), you can get good performance out of the MD4 hash function. As a cryptographic function, it was broken about 15 years ago, but for non cryptographic purposes, it is still very good, and surprisingly fast. In the context of Java, you would have to convert the 16-bit
char
values into 32-bit words, e.g. by grouping such values into pairs. A fast implementation of MD4 in Java can be found in sphlib. Probably overkill in the context of a classroom assignment, but otherwise worth a try.来源
djb2 哈希函数背后的逻辑 - SO
source
Logic behind djb2 hash function - SO
如果您想查看行业标准实现,我会查看 java.security.MessageDigest。
“消息摘要是一种安全的单向哈希函数,它采用任意大小的数据并输出固定长度的哈希值。”
If you want to see the industry standard implementations, I'd look at java.security.MessageDigest.
"Message digests are secure one-way hash functions that take arbitrary-sized data and output a fixed-length hash value."
sdbm:该算法是为 sdbm(ndbm 的公共域重新实现)数据库库创建的
sdbm:this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library
这将避免任何碰撞,并且在我们在计算中使用移位之前它会很快。
This will avoid any collision and it will be fast until we use the shifting in calculations.
在尝试为字符串开发良好的哈希函数时,最好使用奇数。这个函数接受一个字符串并返回一个索引值,到目前为止它的工作效果很好。并且碰撞较少。索引范围从 0 - 300 甚至更多,但到目前为止,即使使用像“机电工程”这样的长单词,我也没有得到任何更高的值,
您可以做的另一件事是将每个字符 int parse 乘以索引,因为它会增加,例如“熊”字
(0*b) + (1*e) + (2*a) + (3*r) 这会给你一个 int 值来使用。上面的第一个哈希函数在“这里”和“听到”发生冲突,但仍然擅长提供一些良好的独特值。下面的字符不会与“这里”和“听到”冲突,因为我将每个字符与索引相乘,随着索引的增加。
Its a good idea to work with odd number when trying to develop a good hast function for string. this function takes a string and return a index value, so far its work pretty good. and has less collision. the index ranges from 0 - 300 maybe even more than that, but i haven't gotten any higher so far even with long words like "electromechanical engineering"
another thing you can do is multiplying each character int parse by the index as it increase like the word "bear"
(0*b) + (1*e) + (2*a) + (3*r) which will give you an int value to play with. the first hash function above collide at "here" and "hear" but still great at give some good unique values. the one below doesn't collide with "here" and "hear" because i multiply each character with the index as it increases.
这是一个简单的哈希函数,我将其用于构建的哈希表。它基本上用于获取文本文件并将每个单词存储在代表字母顺序的索引中。
这基本上是根据单词的第一个字母对单词进行哈希处理。因此,以“a”开头的单词的哈希键为 0,“b”为 1,依此类推,“z”为 25。数字和符号的哈希键为 26。这提供了一个优点;您可以轻松快速地计算给定单词在哈希表中的索引位置,因为它全部按字母顺序排列,如下所示:
代码可以在这里找到:https://github.com/abhijitcpatil/general
这将是输出:
Here's a simple hash function that I use for a hash table I built. Its basically for taking a text file and stores every word in an index which represents the alphabetical order.
What this basically does is words are hashed according to their first letter. So, word starting with 'a' would get a hash key of 0, 'b' would get 1 and so on and 'z' would be 25. Numbers and symbols would have a hash key of 26. THere is an advantage this provides; You can calculate easily and quickly where a given word would be indexed in the hash table since its all in an alphabetical order, something like this:
Code can be found here: https://github.com/abhijitcpatil/general
This would be the output: