良好的字符串哈希函数

发布于 2024-08-29 03:04:23 字数 132 浏览 3 评论 0原文

我正在尝试为字符串想出一个好的哈希函数。我认为总结字符串中前五个字符的 unicode 值可能是个好主意(假设它有五个,否则在结束处停止)。这是一个好主意,还是一个坏主意?

我正在用 Java 做这件事,但我认为这不会产生太大的影响。

I'm trying to think up a good hash function for strings. And I was thinking it might be a good idea to sum up the unicode values for the first five characters in the string (assuming it has five, otherwise stop where it ends). Would that be a good idea, or is it a bad one?

I am doing this in Java, but I wouldn't imagine that would make much of a difference.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

世态炎凉 2024-09-05 03:04:23

通常哈希值不会求和,否则 stoppots 将具有相同的哈希值。

并且您不会将其限制为前 n 个字符,因为否则 house 和 house 将具有相同的哈希值。

一般来说,散列采用值并将其乘以素数(使其更有可能生成唯一的散列),因此您可以执行以下操作:

int hash = 7;
for (int i = 0; i < strlen; i++) {
    hash = hash*31 + charAt(i);
}

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
    hash = hash*31 + charAt(i);
}
夏雨凉 2024-09-05 03:04:23

如果是安全问题,您可以使用 Java 加密:

import java.security.MessageDigest;

MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(stringToHash.getBytes());
String stringHash = new String(messageDigest.digest());

If it's a security thing, you could use Java crypto:

import java.security.MessageDigest;

MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(stringToHash.getBytes());
String stringHash = new String(messageDigest.digest());
乄_柒ぐ汐 2024-09-05 03:04:23

您可能应该使用 String.hashCode()

如果你真的想自己实现 hashCode:

不要试图排除
物体的重要部分来自
哈希码计算改进
性能——Joshua Bloch,《Effective Java》

仅使用前五个字符是一个坏主意。考虑一下分层名称,例如 URL:它们都将具有相同的哈希代码(因为它们都以“http://”开头,这意味着它们存储在哈希映射中的同一个存储桶下,表现出糟糕的性能。

这里是一个战争故事,根据“Effective Java”中的字符串 hashCode 进行解释:

实现的字符串哈希函数
在 1.2 之前检查的所有版本中
最多十六个字符,均匀
在整个字符串中间隔,开始
与第一个字符。对于大型
分层名称的集合,
比如 URL,这个哈希函数
表现出可怕的行为。

You should probably use String.hashCode().

If you really want to implement hashCode yourself:

Do not be tempted to exclude
significant parts of an object from
the hash code computation to improve
performance -- Joshua Bloch, Effective Java

Using only the first five characters is a bad idea. Think about hierarchical names, such as URLs: they will all have the same hash code (because they all start with "http://", which means that they are stored under the same bucket in a hash map, exhibiting terrible performance.

Here's a war story paraphrased on the String hashCode from "Effective Java":

The String hash function implemented
in all releases prior to 1.2 examined
at most sixteen characters, evenly
spaced throughout the string, starting
with the first character. For large
collections of hierarchical names,
such as URLs, this hash function
displayed terrible behavior.

睫毛上残留的泪 2024-09-05 03:04:23

如果您在 Java 中执行此操作,那么您为什么要这样做?只需对字符串调用 .hashCode()

If you are doing this in Java then why are you doing it? Just call .hashCode() on the string

似梦非梦 2024-09-05 03:04:23

Guava 的 HashFunction (javadoc)提供了不错的非加密-强哈希。

Guava's HashFunction (javadoc) provides decent non-crypto-strong hashing.

意犹 2024-09-05 03:04:23

Nick提供的这个函数很好,但是如果使用new String(byte[] bytes)来转换为String,则失败。您可以使用此功能来做到这一点。

private static final char[] hex = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };

public static String byteArray2Hex(byte[] bytes) {
    StringBuffer sb = new StringBuffer(bytes.length * 2);
    for(final byte b : bytes) {
        sb.append(hex[(b & 0xF0) >> 4]);
        sb.append(hex[b & 0x0F]);
    }
    return sb.toString();
}

public static String getStringFromSHA256(String stringToEncrypt) throws NoSuchAlgorithmException {
    MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
    messageDigest.update(stringToEncrypt.getBytes());
    return byteArray2Hex(messageDigest.digest());
}

也许这可以帮助某人

This function provided by Nick is good but if you use new String(byte[] bytes) to make the transformation to String, it failed. You can use this function to do that.

private static final char[] hex = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };

public static String byteArray2Hex(byte[] bytes) {
    StringBuffer sb = new StringBuffer(bytes.length * 2);
    for(final byte b : bytes) {
        sb.append(hex[(b & 0xF0) >> 4]);
        sb.append(hex[b & 0x0F]);
    }
    return sb.toString();
}

public static String getStringFromSHA256(String stringToEncrypt) throws NoSuchAlgorithmException {
    MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
    messageDigest.update(stringToEncrypt.getBytes());
    return byteArray2Hex(messageDigest.digest());
}

May be this can help somebody

如日中天 2024-09-05 03:04:23

据传 FNV-1 是一个很好的字符串哈希函数。

对于长字符串(例如,超过 200 个字符),您可以通过 获得良好的性能MD4 哈希函数。作为一种加密功能,它大约在 15 年前就被破解了,但对于非加密目的来说,它仍然非常好,而且速度快得惊人。在 Java 环境中,您必须将 16 位 char 值转换为 32 位字,例如将这些值分组为对。 MD4 在 Java 中的快速实现可以在 sphlib。在课堂作业中可能有点矫枉过正,但在其他方面值得一试。

FNV-1 is rumoured to be a good hash function for strings.

For long strings (longer than, say, about 200 characters), you can get good performance out of the MD4 hash function. As a cryptographic function, it was broken about 15 years ago, but for non cryptographic purposes, it is still very good, and surprisingly fast. In the context of Java, you would have to convert the 16-bit char values into 32-bit words, e.g. by grouping such values into pairs. A fast implementation of MD4 in Java can be found in sphlib. Probably overkill in the context of a classroom assignment, but otherwise worth a try.

苏别ゝ 2024-09-05 03:04:23
// djb2 hash function
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

来源
djb2 哈希函数背后的逻辑 - SO

// djb2 hash function
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

source
Logic behind djb2 hash function - SO

梅窗月明清似水 2024-09-05 03:04:23

如果您想查看行业标准实现,我会查看 java.security.MessageDigest

“消息摘要是一种安全的单向哈希函数,它采用任意大小的数据并输出固定长度的哈希值。”

If you want to see the industry standard implementations, I'd look at java.security.MessageDigest.

"Message digests are secure one-way hash functions that take arbitrary-sized data and output a fixed-length hash value."

海拔太高太耀眼 2024-09-05 03:04:23

sdbm:该算法是为 sdbm(ndbm 的公共域重新实现)数据库库创建的

static unsigned long sdbm(unsigned char *str)
{   
    unsigned long hash = 0;
    int c;
    while (c = *str++)
            hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

sdbm:this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library

static unsigned long sdbm(unsigned char *str)
{   
    unsigned long hash = 0;
    int c;
    while (c = *str++)
            hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}
素罗衫 2024-09-05 03:04:23
         public String hashString(String s) throws NoSuchAlgorithmException {
    byte[] hash = null;
    try {
        MessageDigest md = MessageDigest.getInstance("SHA-256");
        hash = md.digest(s.getBytes());

    } catch (NoSuchAlgorithmException e) { e.printStackTrace(); }
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < hash.length; ++i) {
        String hex = Integer.toHexString(hash[i]);
        if (hex.length() == 1) {
            sb.append(0);
            sb.append(hex.charAt(hex.length() - 1));
        } else {
            sb.append(hex.substring(hex.length() - 2));
        }
    }
    return sb.toString();
}
         public String hashString(String s) throws NoSuchAlgorithmException {
    byte[] hash = null;
    try {
        MessageDigest md = MessageDigest.getInstance("SHA-256");
        hash = md.digest(s.getBytes());

    } catch (NoSuchAlgorithmException e) { e.printStackTrace(); }
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < hash.length; ++i) {
        String hex = Integer.toHexString(hash[i]);
        if (hex.length() == 1) {
            sb.append(0);
            sb.append(hex.charAt(hex.length() - 1));
        } else {
            sb.append(hex.substring(hex.length() - 2));
        }
    }
    return sb.toString();
}
冰魂雪魄 2024-09-05 03:04:23

这将避免任何碰撞,并且在我们在计算中使用移位之前它会很快。

 int k = key.length();
    int sum = 0;
    for(int i = 0 ; i < k-1 ; i++){
        sum += key.charAt(i)<<(5*i);
    }

This will avoid any collision and it will be fast until we use the shifting in calculations.

 int k = key.length();
    int sum = 0;
    for(int i = 0 ; i < k-1 ; i++){
        sum += key.charAt(i)<<(5*i);
    }
风柔一江水 2024-09-05 03:04:23

在尝试为字符串开发良好的哈希函数时,最好使用奇数。这个函数接受一个字符串并返回一个索引值,到目前为止它的工作效果很好。并且碰撞较少。索引范围从 0 - 300 甚至更多,但到目前为止,即使使用像“机电工程”这样的长单词,我也没有得到任何更高的值,

int keyHash(string key)
{
    unsigned int k = (int)key.length();
    unsigned int u = 0,n = 0;

    for (Uint i=0; i<k; i++)
    {
        n = (int)key[i];
        u += 7*n%31;
    }
    return u%139;
}

您可以做的另一件事是将每个字符 int parse 乘以索引,因为它会增加,例如“熊”字
(0*b) + (1*e) + (2*a) + (3*r) 这会给你一个 int 值来使用。上面的第一个哈希函数在“这里”和“听到”发生冲突,但仍然擅长提供一些良好的独特值。下面的字符不会与“这里”和“听到”冲突,因为我将每个字符与索引相乘,随着索引的增加。

int keyHash(string key)
{
    unsigned int k = (int)key.length();
    unsigned int u = 0,n = 0;

    for (Uint i=0; i<k; i++)
    {
        n = (int)key[i];
        u += i*n%31;
    }
    return u%139;
}

Its a good idea to work with odd number when trying to develop a good hast function for string. this function takes a string and return a index value, so far its work pretty good. and has less collision. the index ranges from 0 - 300 maybe even more than that, but i haven't gotten any higher so far even with long words like "electromechanical engineering"

int keyHash(string key)
{
    unsigned int k = (int)key.length();
    unsigned int u = 0,n = 0;

    for (Uint i=0; i<k; i++)
    {
        n = (int)key[i];
        u += 7*n%31;
    }
    return u%139;
}

another thing you can do is multiplying each character int parse by the index as it increase like the word "bear"
(0*b) + (1*e) + (2*a) + (3*r) which will give you an int value to play with. the first hash function above collide at "here" and "hear" but still great at give some good unique values. the one below doesn't collide with "here" and "hear" because i multiply each character with the index as it increases.

int keyHash(string key)
{
    unsigned int k = (int)key.length();
    unsigned int u = 0,n = 0;

    for (Uint i=0; i<k; i++)
    {
        n = (int)key[i];
        u += i*n%31;
    }
    return u%139;
}
静赏你的温柔 2024-09-05 03:04:23

这是一个简单的哈希函数,我将其用于构建的哈希表。它基本上用于获取文本文件并将每个单词存储在代表字母顺序的索引中。

int generatehashkey(const char *name)
{
        int x = tolower(name[0])- 97;
        if (x < 0 || x > 25)
           x = 26;
        return x;
}

这基本上是根据单词的第一个字母对单词进行哈希处理。因此,以“a”开头的单词的哈希键为 0,“b”为 1,依此类推,“z”为 25。数字和符号的哈希键为 26。这提供了一个优点;您可以轻松快速地计算给定单词在哈希表中的索引位置,因为它全部按字母顺序排列,如下所示:
代码可以在这里找到:https://github.com/abhijitcpatil/general

给出以下文本作为输入:有一天,阿蒂克斯对杰姆说,“我宁愿你在后院的锡罐上开枪,但我知道你会去
在鸟之后。如果你能击中蓝鸟的话,就可以射击所有你想要的蓝鸟,但是
请记住,杀死一只知更鸟是一种罪过。”那是我唯一一次
曾听阿迪克斯说做某事是一种罪过,我问小姐
莫迪关于这件事。 “你父亲是对的,”她说。 “知更鸟不
除了创作音乐供我们欣赏之外,做一件事。他们不吃饱
人们的花园,不要在玉米仓里筑巢,他们不做一件事
但为我们唱出他们的心声。这就是为什么杀死一个人是一种罪过
知更鸟。

这将是输出:

0 --> a a about asked and a Atticus a a all after at Atticus
1 --> but but blue birds. but backyard
2 --> cribs corn can cans
3 --> do don’t don’t don’t do don’t do day
4 --> eat enjoy. except ever
5 --> for for father’s
6 --> gardens go
7 --> hearts heard hit
8 --> it’s in it. I it I it’s if I in
9 --> jays Jem
10 --> kill kill know
11 --> 
12 --> mockingbird. music make Maudie Miss mockingbird.”
13 --> nest
14 --> out one one only one
15 --> people’s
16 --> 17 --> right remember rather
18 --> sin sing said. she something sin say sin Shoot shot said
19 --> to That’s their thing they They to thing to time the That to the the tin to
20 --> us. up us
21 --> 
22 --> why was was want
23 --> 
24 --> you you you’ll you
25 --> 
26 --> “Mockingbirds ” “Your ‘em “I’d

Here's a simple hash function that I use for a hash table I built. Its basically for taking a text file and stores every word in an index which represents the alphabetical order.

int generatehashkey(const char *name)
{
        int x = tolower(name[0])- 97;
        if (x < 0 || x > 25)
           x = 26;
        return x;
}

What this basically does is words are hashed according to their first letter. So, word starting with 'a' would get a hash key of 0, 'b' would get 1 and so on and 'z' would be 25. Numbers and symbols would have a hash key of 26. THere is an advantage this provides; You can calculate easily and quickly where a given word would be indexed in the hash table since its all in an alphabetical order, something like this:
Code can be found here: https://github.com/abhijitcpatil/general

Giving the following text as input: Atticus said to Jem one day, “I’d rather you shot at tin cans in the backyard, but I know you’ll go
after birds. Shoot all the blue jays you want, if you can hit ‘em, but
remember it’s a sin to kill a mockingbird.” That was the only time I
ever heard Atticus say it was a sin to do something, and I asked Miss
Maudie about it. “Your father’s right,” she said. “Mockingbirds don’t
do one thing except make music for us to enjoy. They don’t eat up
people’s gardens, don’t nest in corn cribs, they don’t do one thing
but sing their hearts out for us. That’s why it’s a sin to kill a
mockingbird.

This would be the output:

0 --> a a about asked and a Atticus a a all after at Atticus
1 --> but but blue birds. but backyard
2 --> cribs corn can cans
3 --> do don’t don’t don’t do don’t do day
4 --> eat enjoy. except ever
5 --> for for father’s
6 --> gardens go
7 --> hearts heard hit
8 --> it’s in it. I it I it’s if I in
9 --> jays Jem
10 --> kill kill know
11 --> 
12 --> mockingbird. music make Maudie Miss mockingbird.”
13 --> nest
14 --> out one one only one
15 --> people’s
16 --> 17 --> right remember rather
18 --> sin sing said. she something sin say sin Shoot shot said
19 --> to That’s their thing they They to thing to time the That to the the tin to
20 --> us. up us
21 --> 
22 --> why was was want
23 --> 
24 --> you you you’ll you
25 --> 
26 --> “Mockingbirds ” “Your ‘em “I’d
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文