可并行哈希算法，其中子字符串的大小和顺序无关

发布于 2024-11-28 20:55:43 字数 981 浏览 3 评论 0原文

编辑

这是我试图解决的问题：

我有一个字符串分成多个部分。这些部分的长度不相等或不可预测。每个部分都会有一个哈希值。当我连接各个部分时，我希望能够使用每个部分的哈希值来快速获取各个部分的哈希值。此外，将各部分放在一起生成的哈希必须与字符串作为整体进行哈希处理时生成的哈希相匹配。

基本上，我想要一种哈希算法，其中被哈希的数据部分可以并行哈希，并且我不希望各部分的顺序或长度很重要。我不会分解字符串，而是以不可预测的顺序以不可预测的块形式接收它。

我愿意确保较高的碰撞率，只要不是太高。我也可以接受稍微慢一点的算法，因为它在小字符串上几乎不明显，并且在大字符串上并行完成。

我熟悉一些哈希算法，但是我目前有一个哈希算法的用例，其属性是两个哈希值之和等于两个项目之和的哈希值。

要求/给定

该算法将哈希长度至少为 1 个字节的字节串
hash("ab") = hash('a') + hash('b')
具有不同顺序的相同字符的字符串之间的冲突是可以的
生成的散列应该是原始大小的整数（通常是 32/64 位）
字符串可以包含 0-256 之间的任何字符（长度已知，不是 \0 终止）
到目前为止，ascii 字母数字字符将是最常用
不成比例的字符串数量将是 1-8 个 ASCII 字符
非常小的百分比的字符串实际上包含值等于或大于 127 的字节

如果这是一种与其相关的术语的算法，我很想知道那个术语。如果我知道这种类型的哈希算法的正确术语/名称是什么，那么用谷歌搜索就会容易得多。

我认为实现这一点的最简单方法是：

任何字节的散列应该是它的值，标准化为<128（如果>128减去128）
要获得字符串的散列，您将每个字节标准化为<128并将其相加到密钥
根据密钥大小，我可能需要限制用于散列的字符数以避免溢出

原文

EDIT

Here is the problem I am trying to solve:

I have a string broken up into multiple parts. These parts are not of equal, or predictable length. Each part will have a hash value. When I concatenate parts I want to be able to use the hash values from each part to quickly get the hash value for the parts together. In addition the hash generated by putting the parts together must match the hash generated if the string were hashed as a whole.

Basically I want a hashing algorithm where the parts of the data being hashed can be hashed in parallel, and I do not want the order or length of the pieces to matter. I am not breaking up the string, but rather receiving it in unpredictable chunks in an unpredictable order.

I am willing to ensure an elevated collision rate, so long as it is not too elevated. I am also ok with a slightly slower algorithm as it is hardly noticeable on small strings, and done in parallel for large strings.

I am familiar with a few hashing algorithms, however I currently have a use-case for a hash algorithm with the property that the sum of two hashes is equal to a hash of the sum of the two items.

Requirements/givens

This algorithm will be hashing byte-strings with length of at least 1 byte
hash("ab") = hash('a') + hash('b')
Collisions between strings with the same characters in different order is ok
Generated hash should be an integer of native size (usually 32/64 bits)
String may contain any character from 0-256 (length is known, not \0 terminated)
The ascii alpha-numeric characters will be by far the most used
A disproportionate number of strings will be 1-8 ASCII characters
A very tiny percentage of the strings will actually contain bytes with values at or above 127

If this is a type of algorithm that has terminology associated with it, I would love to know that terminology. If I knew what a proper term/name for this type of hashing algorithm was it would be much easier to google.

I am thinking the simplest way to achieve this is:

Any byte's hash should be its value, normalized to <128 (if >128 subtract 128)
To get the hash of a string you normalize each byte to <128 and add it to the key
Depending on key size I may need to limit how many characters are used to hash to avoid overflow

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

颜漓半夏 2024-12-05 20:55:43

我不认为仅添加每个（无符号）字节值来创建一个哈希值（它只是所有字符的总和）有什么问题。溢出并没有什么问题：即使达到 32/64 位限制（并且必须是非常/非常长的字符串才能做到这一点），溢出到负数在 2 的补码算术中也无关紧要。由于这是一个线性过程，因此如何分割字符串并不重要。

回复收藏 0 原文

只是在用心讲痛 2024-12-05 20:55:43

这是一个非常古老的问题，但有点有趣，所以在 13 年后添加一个答案......

哈希(“ab”) = 哈希('a') + 哈希('b')

hash 可以传播/返回的最详细信息是每个字节值出现次数的计数数组。被看见了。该数组的大小将大于您想要的最终 32/64 位哈希值，但如果您保留并统计数组信息，然后执行以下操作，您将获得明显更好的整体哈希值对其进行单个最终更高质量的哈希以生成 32 位或 64 位哈希。

为了说明这一点，假设您对“a”、“b”和“c”进行哈希处理 - 简单的加法（如 Adrian 的答案）将产生与对“b”进行 3 次哈希处理相同的整体哈希值。非常容易碰撞。但是，如果您有效地散列不同的数组 ['a']=1、['b']=1、['c']=1 （和其他元素 0）与 ['b']=3，您可以产生不同的哈希值。如何将 256 计数器数组哈希为 32 位或 64 位值？用 C++ 来说明：

using Counter = uint32_t; // or 16, 64 as needed...
Counter counters[256]{};
..."hash" substrings, accumulating counts in "counters"...

// if your C++ implementation has a decent string_view hash,
// e.g. GCC/Clang using MURMUR32, this is easy...
std::string_view sv{(const char*)counters, sizeof counters};
auto overall_hash = std::hash<std::string_view>{}(sv);

This is a very old question, but kind of interesting, so adding an answer 13 years later....

hash("ab") = hash('a') + hash('b')

The most detailed information hash can propagate/return is an array of counts of how many times each byte-value has been seen. The size of this array will be bigger than the final 32/64-bit hash value that you want, but you'll get a significantly better overall hash if you preserve and tally the array information then do a single final higher-quality hash on it to produce the 32- or 64-bit hash.

To illustrate, say you hash "a" and "b" and "c" - a naive addition (as in Adrian's answer) would produce the same overall hash as hashing "b" 3 times. Very collision prone. But, if you're effectively hashing distinct arrays ['a']=1, ['b']=1, ['c']=1 (and other elements 0) vs ['b']=3, you can produce distinct hash values. How do you hash from a 256-counter array to a 32- or 64-bit value? To illustrate in C++:

using Counter = uint32_t; // or 16, 64 as needed...
Counter counters[256]{};
..."hash" substrings, accumulating counts in "counters"...

// if your C++ implementation has a decent string_view hash,
// e.g. GCC/Clang using MURMUR32, this is easy...
std::string_view sv{(const char*)counters, sizeof counters};
auto overall_hash = std::hash<std::string_view>{}(sv);

回复收藏 0 原文

~没有更多了~