字符串的恒定时间哈希?
关于 SO 的另一个问题提出了某些语言中对字符串进行哈希处理的功能,以便在表中快速查找它们。这方面的两个例子是dictionary<>和dictionary<>。 .NET 中的存储结构和 Python 中的 {} 存储结构。其他语言当然也支持这种机制。 C++ 有它的映射,LISP 有一个等效的映射,大多数其他现代语言也是如此。
在这个问题的答案中,一位拥有 25 年编程经验的 SO 成员声称,字符串上的哈希算法可以在恒定时间内进行,他声称任何东西都可以在恒定时间内进行哈希处理。我个人的观点是,这不是真的,除非您的特定应用程序对字符串长度设置了边界。这意味着某个常数 K 将决定字符串的最大长度。
我熟悉 Rabin-Karp 算法,该算法使用散列函数进行操作,但该算法没有规定要使用的特定散列函数,作者建议的散列函数是 O(m),其中 m 是散列字符串。
我看到了其他一些页面,例如这个页面 (http://www.cse.yorku. ca/~oz/hash.html)显示了一些哈希算法,但似乎每个算法都会迭代字符串的整个长度以得出其值。
从我对这个主题相对有限的阅读来看,大多数字符串类型的关联数组实际上是使用哈希函数创建的,该函数在幕后与某种树一起操作。这可能是指向键/值对中值元素位置的 AVL 树或红/黑树。
即使有了这种树结构,如果我们要保持 theta(log(n)) 的顺序,其中 n 是树中的元素数量,我们就需要一个恒定时间哈希算法。否则,我们会因迭代字符串而受到额外的惩罚。尽管对于包含许多字符串的索引,theta(m) 会被 theta(log(n)) 黯然失色,但如果我们处于搜索文本非常大的域中,我们就不能忽略它。
我知道后缀树/数组和 Aho-Corasick 可以将搜索降低到 theta(m),从而增加内存消耗,但我具体要问的是,对于任意长度的字符串是否存在恒定时间哈希方法由另一位 SO 成员声称。
谢谢。
Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
哈希函数不必(也不能)为每个字符串返回唯一的值。
您可以使用前 10 个字符来初始化随机数生成器,然后使用它从字符串中提取 100 个随机字符,并对其进行哈希处理。这将是常数时间。
您也可以只返回常量值 1。严格来说,这仍然是一个哈希函数,尽管不是一个非常有用的函数。
A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.
一般来说,我认为任何完整的字符串哈希都必须使用字符串的每个字符,因此需要以 O(n) 的速度增长 n 个字符。不过,我认为对于实际的字符串哈希,您可以使用可以轻松达到 O(1) 的近似哈希。
考虑一个始终使用 Min(n, 20) 个字符来计算标准哈希的字符串哈希。显然,随着字符串大小的增加,它的增长速度为 O(1)。它能可靠地工作吗?这取决于您的域...
In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...
如果不冒严重哈希冲突的风险,您就无法轻松实现字符串的通用恒定时间哈希算法。
由于它是常数时间,您将无法访问字符串中的每个字符。作为一个简单的例子,假设我们采用前 6 个字符。然后有人尝试对 URL 数组进行哈希处理。 has 函数将看到每个字符串的“http://”。
对于其他字符选择方案也可能发生类似的情况。您可以根据前一个字符的值伪随机地选择字符,但是如果字符串由于某种原因具有“错误”模式并且许多字符串最终具有相同的哈希值,那么您仍然面临着失败的风险。
You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.
虽然我无法想象无限长度字符串的固定时间哈希函数,但确实没有必要。
使用哈希函数背后的想法是生成哈希值的分布,使得对于所考虑的域许多字符串不会发生冲突。该密钥将允许直接访问数据存储。这两者结合起来会产生平均时间恒定的查找。
如果发生此类冲突,查找算法就会采用更灵活的查找子策略。
Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.
当然,这是可行的,只要您确保所有字符串在将它们传递给需要散列的东西之前都被“保留”。实习是将字符串插入字符串表的过程,这样所有具有相同值的实习字符串实际上都是同一个对象。然后,您可以简单地散列指向内部字符串的(固定长度)指针,而不是散列字符串本身。
Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.
您可能对我去年得出的以下数学结果感兴趣。
考虑将无限数量的键(例如任意长度的所有字符串的集合)散列为 {1,2,…,b} 中的数字集合的问题。随机散列通过首先在 H 函数族中随机选取一个散列函数 h 来进行。
我将证明总是有无限多个键肯定会在所有 H 函数上发生冲突,也就是说,它们对于所有哈希函数总是具有相同的哈希值。
选择任意一个哈希函数 h:至少有一个哈希值 y 使得集合 A={s:h(s)=y} 是无限的,也就是说,有无限多个字符串发生碰撞。选择任何其他哈希函数 h' 并对集合 A 中的键进行哈希。至少有一个哈希值 y' 使得集合 A'={s 在 A 中: h'(s)=y'} 是无限的,也就是说,有无限多个字符串在两个哈希函数上发生冲突。您可以多次重复此论证。重复H次。然后你就有了无限的字符串集,其中所有字符串都与所有 H 哈希函数发生冲突。 CQFD。
进一步阅读:
对可变长度字符串进行合理的散列是不可能的
http:// lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/
You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/
如果您使用绳索,您希望渐近小于线性散列时间 而不是字符串,并具有允许您跳过一些计算的共享。但显然哈希函数无法分离它未读取的输入,因此我不会太认真地对待“所有内容都可以在恒定时间内进行哈希处理”。
哈希函数的质量和所需的计算量之间的折衷是任何可能的,并且长字符串上的哈希函数无论如何都必须发生冲突。
您必须确定,如果哈希函数仅查看前缀,则算法中可能出现的字符串是否会经常发生冲突。
You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.