字符串的恒定时间哈希？

发布于 2024-08-13 13:43:13 字数 970 浏览 2 评论 0原文

关于 SO 的另一个问题提出了某些语言中对字符串进行哈希处理的功能，以便在表中快速查找它们。这方面的两个例子是dictionary<>和dictionary<>。 .NET 中的存储结构和 Python 中的 {} 存储结构。其他语言当然也支持这种机制。 C++ 有它的映射，LISP 有一个等效的映射，大多数其他现代语言也是如此。

在这个问题的答案中，一位拥有 25 年编程经验的 SO 成员声称，字符串上的哈希算法可以在恒定时间内进行，他声称任何东西都可以在恒定时间内进行哈希处理。我个人的观点是，这不是真的，除非您的特定应用程序对字符串长度设置了边界。这意味着某个常数 K 将决定字符串的最大长度。

我熟悉 Rabin-Karp 算法，该算法使用散列函数进行操作，但该算法没有规定要使用的特定散列函数，作者建议的散列函数是 O(m)，其中 m 是散列字符串。

我看到了其他一些页面，例如这个页面 (http://www.cse.yorku. ca/~oz/hash.html）显示了一些哈希算法，但似乎每个算法都会迭代字符串的整个长度以得出其值。

从我对这个主题相对有限的阅读来看，大多数字符串类型的关联数组实际上是使用哈希函数创建的，该函数在幕后与某种树一起操作。这可能是指向键/值对中值元素位置的 AVL 树或红/黑树。

即使有了这种树结构，如果我们要保持 theta(log(n)) 的顺序，其中 n 是树中的元素数量，我们就需要一个恒定时间哈希算法。否则，我们会因迭代字符串而受到额外的惩罚。尽管对于包含许多字符串的索引，theta(m) 会被 theta(log(n)) 黯然失色，但如果我们处于搜索文本非常大的域中，我们就不能忽略它。

我知道后缀树/数组和 Aho-Corasick 可以将搜索降低到 theta(m)，从而增加内存消耗，但我具体要问的是，对于任意长度的字符串是否存在恒定时间哈希方法由另一位 SO 成员声称。

谢谢。

原文

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.

It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.

I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.

I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.

From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.

Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.

I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久隐师 2024-08-20 13:43:13

哈希函数不必（也不能）为每个字符串返回唯一的值。

您可以使用前 10 个字符来初始化随机数生成器，然后使用它从字符串中提取 100 个随机字符，并对其进行哈希处理。这将是常数时间。

您也可以只返回常量值 1。严格来说，这仍然是一个哈希函数，尽管不是一个非常有用的函数。

回复收藏 0 原文

记忆里有你的影子 2024-08-20 13:43:13

一般来说，我认为任何完整的字符串哈希都必须使用字符串的每个字符，因此需要以 O(n) 的速度增长 n 个字符。不过，我认为对于实际的字符串哈希，您可以使用可以轻松达到 O(1) 的近似哈希。

考虑一个始终使用 Min(n, 20) 个字符来计算标准哈希的字符串哈希。显然，随着字符串大小的增加，它的增长速度为 O(1)。它能可靠地工作吗？这取决于您的域...

回复收藏 0 原文

云柯 2024-08-20 13:43:13

如果不冒严重哈希冲突的风险，您就无法轻松实现字符串的通用恒定时间哈希算法。

由于它是常数时间，您将无法访问字符串中的每个字符。作为一个简单的例子，假设我们采用前 6 个字符。然后有人尝试对 URL 数组进行哈希处理。 has 函数将看到每个字符串的“http://”。

对于其他字符选择方案也可能发生类似的情况。您可以根据前一个字符的值伪随机地选择字符，但是如果字符串由于某种原因具有“错误”模式并且许多字符串最终具有相同的哈希值，那么您仍然面临着失败的风险。

回复收藏 0 原文

鸩远一方 2024-08-20 13:43:13

虽然我无法想象无限长度字符串的固定时间哈希函数，但确实没有必要。

使用哈希函数背后的想法是生成哈希值的分布，使得对于所考虑的域许多字符串不会发生冲突。该密钥将允许直接访问数据存储。这两者结合起来会产生平均时间恒定的查找。

如果发生此类冲突，查找算法就会采用更灵活的查找子策略。

回复收藏 0 原文

葵雨 2024-08-20 13:43:13

当然，这是可行的，只要您确保所有字符串在将它们传递给需要散列的东西之前都被“保留”。实习是将字符串插入字符串表的过程，这样所有具有相同值的实习字符串实际上都是同一个对象。然后，您可以简单地散列指向内部字符串的（固定长度）指针，而不是散列字符串本身。

回复收藏 0 原文

初见终念 2024-08-20 13:43:13

您可能对我去年得出的以下数学结果感兴趣。

考虑将无限数量的键（例如任意长度的所有字符串的集合）散列为 {1,2,…,b} 中的数字集合的问题。随机散列通过首先在 H 函数族中随机选取一个散列函数 h 来进行。

我将证明总是有无限多个键肯定会在所有 H 函数上发生冲突，也就是说，它们对于所有哈希函数总是具有相同的哈希值。

选择任意一个哈希函数 h：至少有一个哈希值 y 使得集合 A={s:h(s)=y} 是无限的，也就是说，有无限多个字符串发生碰撞。选择任何其他哈希函数 h' 并对集合 A 中的键进行哈希。至少有一个哈希值 y' 使得集合 A'={s 在 A 中： h'(s)=y'} 是无限的，也就是说，有无限多个字符串在两个哈希函数上发生冲突。您可以多次重复此论证。重复H次。然后你就有了无限的字符串集，其中所有字符串都与所有 H 哈希函数发生冲突。 CQFD。

进一步阅读：
对可变长度字符串进行合理的散列是不可能的
http:// lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/