以最快的速度查找已知键集的字符串键
考虑一个具有以下签名的查找函数,它需要为给定的字符串键返回一个整数:
int GetValue(string key) { ... }
此外,考虑到在编写函数源代码时预先知道编号为 N 的键值映射,例如
// N=3
{ "foo", 1 },
{ "bar", 42 },
{ "bazz", 314159 }
:对于上述输入的函数的有效(但不完美!)实现是:
int GetValue(string key)
{
switch (key)
{
case "foo": return 1;
case "bar": return 42;
case "bazz": return 314159;
}
// Doesn't matter what we do here, control will never come to this point
throw new Exception();
}
提前知道每个给定键在运行时将调用该函数多少次(C>=1)。例如:
C["foo"] = 1;
C["bar"] = 1;
C["bazz"] = 2;
但是,此类调用的顺序未知。例如,上面可以描述运行时的以下调用序列:
GetValue("foo");
GetValue("bazz");
GetValue("bar");
GetValue("bazz");
或任何其他序列,只要调用计数匹配。
还有一个限制 M,以最方便的单位指定,定义 GetValue 可以使用的任何查找表和其他辅助结构的内存上限(这些结构提前初始化;初始化不计入函数的复杂性)。例如,M=100 个字符,或 M=256 sizeof(对象引用)。
问题是,如何编写 GetValue
的主体,使其尽可能快 - 换句话说,所有 GetValue
调用的总时间(请注意,我们知道对于给定的 N、C 和 M,总计数(根据上述所有内容)是最小的吗?
该算法可能需要M的合理最小值,例如M>=char.MaxValue
。它还可能要求 M 与某个合理的边界对齐 - 例如,它可能只是 2 的幂。它还可能要求 M 必须是某种类型的 N 的函数(例如,它可能允许有效的 M=N,或 M=2N,...;或有效的 M=N,或 M=N^2, ...; ETC)。
该算法可以用任何合适的语言或其他形式来表达。对于生成代码的运行时性能限制,假设 GetValue
生成的代码将采用 C#、VB 或 Java(实际上,任何语言都可以,只要将字符串视为不可变的字符数组 -即 O(1) 长度和 O(1) 索引,并且没有提前为它们计算其他数据)。另外,为了稍微简化一下,假设所有键的 C=1 的答案被认为是有效的,尽管那些涵盖更一般情况的答案是首选。
对可能方法的一些思考
显然,上述问题的第一个答案是使用完美的哈希,但寻找完美哈希的通用方法似乎并不完美。例如,可以使用 Pearson 哈希对上述示例数据轻松生成一个最小完美哈希表,但每次调用 GetValue
时都必须对输入键进行哈希处理,并且 Pearson 哈希必然扫描整个输入字符串。但所有示例键实际上在第三个字符上有所不同,因此只能将其用作哈希的输入,而不是整个字符串。此外,如果要求M至少为char.MaxValue
,那么第三个字符本身就成为完美的散列。
对于不同的键集,这可能不再成立,但在给出精确答案之前仍然可以减少考虑的字符数量。此外,在某些情况下,最小完美哈希需要检查整个字符串,可以减少对子集的查找,或者以其他方式使其更快(例如不太复杂的哈希函数?)通过使散列非最小(即M>N)——为了速度而有效地牺牲空间。
也可能传统的散列一开始就不是一个好主意,并且更容易将 GetValue
的主体构造为一系列条件条件,安排为第一个检查“最变量” ” 字符(在大多数键上有所不同的字符),根据需要进行进一步的嵌套检查以确定正确的答案。请注意,此处的“方差”可能会受到查找每个键的次数 (C) 的影响。此外,分支的最佳结构应该是什么并不总是显而易见的 - 例如,“最可变”字符只能让您区分 100 个键中的 10 个,但对于其余 90 个键,需要进行一项额外检查没有必要区分它们,并且平均而言(考虑 C)每个键的检查次数比不以“最可变”字符开头的不同解决方案要多。然后的目标是确定完美的检查顺序。
Consider a lookup function with the following signature, which needs to return an integer for a given string key:
int GetValue(string key) { ... }
Consider furthermore that the key-value mappings, numbering N, are known in advance when the source code for function is being written, e.g.:
// N=3
{ "foo", 1 },
{ "bar", 42 },
{ "bazz", 314159 }
So a valid (but not perfect!) implementation for the function for the input above would be:
int GetValue(string key)
{
switch (key)
{
case "foo": return 1;
case "bar": return 42;
case "bazz": return 314159;
}
// Doesn't matter what we do here, control will never come to this point
throw new Exception();
}
It is also known in advance exactly how many times (C>=1) the function will be called at run-time for every given key. For example:
C["foo"] = 1;
C["bar"] = 1;
C["bazz"] = 2;
The order of such calls is not known, however. E.g. the above could describe the following sequence of calls at run-time:
GetValue("foo");
GetValue("bazz");
GetValue("bar");
GetValue("bazz");
or any other sequence, provided the call counts match.
There is also a restriction M, specified in whatever units is most convenient, defining the upper memory bound of any lookup tables and other helper structures that can be used by the GetValue
(the structures are initialized in advance; that initialization is not counted against the complexity of the function). For example, M=100 chars, or M=256 sizeof(object reference).
The question is, how to write the body of GetValue
such that it is as fast as possible - in other words, the aggregate time of all GetValue
calls (note that we know the total count, per everything above) is minimal, for given N, C and M?
The algorithm may require a reasonable minimal value for M, e.g. M >= char.MaxValue
. It may also require that M be aligned to some reasonable boundary - for example, that it may only be a power of two. It may also require that M must be a function of N of a certain kind (for example, it may allow valid M=N, or M=2N, ...; or valid M=N, or M=N^2, ...; etc).
The algorithm can be expressed in any suitable language or other form. For runtime performance constrains for generated code, assume that the generated code for GetValue
will be in C#, VB or Java (really, any language will do, so long as strings are treated as immutable arrays of characters - i.e. O(1) length and O(1) indexing, and no other data computed for them in advance). Also, to simplify this a bit, answers which assume that C=1 for all keys are considered valid, though those answers which cover the more general case are preferred.
Some musings on possible approaches
The obvious first answer to the above is using a perfect hash, but generic approaches to finding one seem to be imperfect. For example, one can easily generate a table for a minimal perfect hash using Pearson hashing for the sample data above, but then the input key would have to be hashed for every call to GetValue
, and Pearson hash necessarily scans the entire input string. But all sample keys actually differ in their third character, so only that can be used as the input for the hash instead of the entire string. Furthermore, if M is required to be at least char.MaxValue
, then the third character itself becomes a perfect hash.
For a different set of keys this may no longer be true, but it may still be possible to reduce the amount of characters considered before the precise answer can be given. Furthermore, in some cases where a minimal perfect hash would require inspecting the entire string, it may be possible to reduce the lookup to a subset, or otherwise make it faster (e.g. a less complex hashing function?) by making the hash non-minimal (i.e. M > N) - effectively sacrificing space for the sake of speed.
It may also be that traditional hashing is not such a good idea to begin with, and it's easier to structure the body of GetValue
as a series of conditionals, arranged such that the first checks for the "most variable" character (the one that varies across most keys), with further nested checks as needed to determine the correct answer. Note that "variance" here can be influenced by the number of times each key is going to be looked up (C). Furthermore, it is not always readily obvious what the best structure of branches should be - it may be, for example, that the "most variable" character only lets you distinguish 10 keys out of 100, but for the remaining 90 that one extra check is unnecessary to distinguish between them, and on average (considering C) there are more checks per key than in a different solution which does not start with the "most variable" character. The goal then is to determine the perfect sequence of checks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您可以使用 Boyer 搜索,但我认为 Trie 将是一种更有效的方法。您可以修改 Trie 来折叠单词,因为您将键的命中计数设为零,从而减少了您必须执行的搜索次数。您将获得的最大好处是您正在对索引进行数组查找,这比比较快得多。
You could use the Boyer search, but I think that the Trie would be a much more effiecent method. You can modify the Trie to collapse the words as you make the hit count for a key zero, thus reducing the number of searches you would have to do the farther down the line you get. The biggest benefit you would get is that you are doing array lookups for the indexes, which is much faster than a comparison.
您谈到了预计算时的内存限制 - 是否也存在时间限制?
我会考虑一个特里树,但你不一定从第一个字符开始。相反,找到最能减少搜索空间的索引,并首先考虑这一点。因此,在您的示例案例(“foo”、“bar”、“bazz”)中,您将采用第三个字符,它会立即告诉您它是哪个字符串。 (如果我们知道我们总是会得到一个输入单词,那么我们可以在找到唯一的潜在匹配后立即返回。)
现在假设不< /em> 单个索引将使您找到唯一的字符串,您需要确定之后要查看的字符。理论上,您可以预先计算 trie 来计算出对于每个分支接下来要查看的最佳字符是什么(例如“如果第三个字符是“a”,我们接下来需要查看第二个字符;如果是“o”,我们接下来需要查看第一个字符)但这可能会花费更多的时间和空间,但另一方面,它可以节省很多时间。 > 时间 - 因为每一个分支都下降了一个字符可能有一个索引可以唯一地标识最终的字符串,但每次都是不同的索引。这种方法所需的空间量取决于字符串的相似程度,并且可能很难提前预测。很高兴能够对所有的 trie 节点动态执行此操作,但是当您发现构造空间不足时,请为“该节点下的所有内容”确定一个顺序(所以您不会结束。在其下面的每个节点上存储“下一个字符索引”节点,只是单个序列。)如果不清楚,请告诉我,我可以尝试详细说明...
如何表示特里树将取决于输入字符的范围。如果它们都在“a”-“z”范围内,那么一个简单的数组导航起来会非常快,并且对于大多数可用选项都有可能的 trie 节点来说相当有效。稍后,当只有两三个可能的分支时,就会浪费内存。我建议使用多态 Trie 节点类,这样您就可以根据有多少个子分支构建最合适的节点类型。
这些都不会执行任何剔除 - 目前尚不清楚通过快速剔除可以实现多少效果。我可以看到它有帮助的一种情况是,当一个 trie 节点的分支数量下降到 1 时(因为删除了一个已耗尽的分支),可以完全消除该分支。随着时间的推移,这可能会产生很大的影响,并且应该不难计算。基本上,当您构建特里树时,您可以预测每个分支将被采用多少次,并且当您导航特里树时,您可以在导航时从每个分支的计数中减去一它。
这就是我到目前为止所想到的一切,它并不完全是一个完整的实现 - 但我希望它有所帮助......
You've talked about a memory limitation when it comes to precomputation - is there also a time limitation?
I would consider a trie, but one where you didn't necessarily start with the first character. Instead, find the index which will cut down the search space most, and consider that first. So in your sample case ("foo", "bar", "bazz") you'd take the third character, which would immediately tell you which string it was. (If we know we'll always be given one of the input words, we can return as soon as we've found a unique potential match.)
Now assuming that there isn't a single index which will get you down to a unique string, you need to determine the character to look at after that. In theory you precompute the trie to work out for each branch what the optimal character to look at next is (e.g. "if the third character was 'a', we need to look at the second character next; if it was 'o' we need to look at the first character next) but that potentially takes a lot more time and space. On the other hand, it could save a lot of time - because having gone down one character, each of the branches may have an index to pick which will uniquely identify the final string, but be a different index each time. The amount of space required by this approach would depend on how similar the strings were, and might be hard to predict in advance. It would be nice to be able to dynamically do this for all the trie nodes you can, but then when you find you're running out of construction space, determine a single order for "everything under this node". (So you don't end up storing a "next character index" on each node underneath that node, just the single sequence.) Let me know if this isn't clear, and I can try to elaborate...
How you represent the trie will depend on the range of input characters. If they're all in the range 'a'-'z' then a simple array would be incredibly fast to navigate, and reasonably efficient for trie nodes where there are possibilities for most of the available options. Later on, when there are only two or three possible branches, that becomes wasteful in memory. I would suggest a polymorphic Trie node class, such that you can build the most appropriate type of node depending on how many sub-branches there are.
None of this performs any culling - it's not clear how much can be achieved by culling quickly. One situation where I can see it helping is when the number of branches from one trie node drops to 1 (because of the removal of a branch which is exhausted), that branch can be eliminated completely. Over time this could make a big difference, and shouldn't be too hard to compute. Basically as you build the trie you can predict how many times each branch will be taken, and as you navigate the trie you can subtract one from that count per branch when you navigate it.
That's all I've come up with so far, and it's not exactly a full implementation - but I hope it helps...
表的二分查找真的那么糟糕吗?我会获取潜在字符串的列表并“最小化”它们,对它们进行排序,最后对它们的块进行二分搜索。
我所说的最小化是指将它们减少到所需的最低限度,有点像自定义词干。
例如,如果您有字符串:“alfred”、“bob”、“bill”、“joe”,我会将它们分解为“a”、“bi”、“bo”、“j”。
然后将它们放入连续的内存块中,例如:
理想情况下,如果您只需执行以下操作,编译器就会为您完成所有这些操作:
但我不能说这是否属实。
现在您可以bsearch该表,并且键尽可能短。如果击中了琴键的末端,则匹配。如果没有,则遵循标准 bsearch 算法。
目标是将所有数据紧密结合在一起并保持代码很小,以便全部适合 CPU 缓存。您可以直接从程序处理密钥,无需预处理或添加任何内容。
对于合理分布的相当大量的密钥,我认为这会相当快。这实际上取决于所涉及的字符串数量。对于较小的数字,计算哈希值等的开销比搜索这样的东西要多。对于更大的值,这是值得的。这些数字到底是多少都取决于算法等。
但是,如果这很重要的话,这可能是内存方面最小的解决方案。
这也有简单的好处。
附录:
除了“字符串”之外,您没有任何关于输入的规范。也没有讨论您期望使用多少个字符串、它们的长度、它们的通用性或它们的使用频率。这些或许都可以从“源头”推导出来,但不是算法设计者计划好的。您需要一种创建类似以下内容的算法:
对于一个始终只使用一个密钥的小程序,一直到为数百万个字符串创建完美的哈希算法。这是一个相当艰巨的任务。
任何追求“尽可能压缩每一点性能”的设计都需要比“任何和所有字符串”更多地了解输入。如果您希望在任何情况下都能以最快的速度解决问题,那么问题空间就太大了。
处理具有极长相同前缀的字符串的算法可能与处理完全随机字符串的算法有很大不同。该算法可能会说“如果密钥以“a”开头,则跳过接下来的 100 个字符,因为它们都是 a”。
但是,如果这些字符串是由人类获取的,并且他们使用相同字母的长字符串,并且没有疯狂地尝试维护这些数据,那么当他们抱怨算法表现不佳时,你会回答说“你是做愚蠢的事情,不要这样做”。但我们也不知道这些字符串的来源。
因此,您需要选择一个问题空间来定位算法。我们有各种各样的算法,表面上做同样的事情,因为它们解决不同的约束并且在不同的情况下工作得更好。
散列是昂贵的,布置散列图也是昂贵的。如果涉及的数据不够多,还有比散列更好的技术。如果您有大量内存预算,您可以基于每个节点的 N 个状态创建一个巨大的状态机(N 是您的字符集大小 - 您没有指定 - BAUDOT?7 位 ASCII?UTF-32?) 。这将运行得非常快,除非状态消耗的内存量破坏了 CPU 缓存或挤出了其他东西。
您可以为所有这些生成代码,但您可能会遇到代码大小限制(您也没有说明是什么语言 - 例如,Java 有 64K 方法字节代码限制)。
但您没有指定任何这些约束。因此,很难获得满足您需求的最高性能的解决方案。
Is a binary search of the table really so awful? I would take the list of potential strings and "minimize" them, the sort them, and finally do a binary search upon the block of them.
By minimize I mean reducing them to the minimum they need to be, kind of a custom stemming.
For example if you had the strings: "alfred", "bob", "bill", "joe", I'd knock them down to "a", "bi", "bo", "j".
Then put those in to a contiguous block of memory, for example:
Ideally the compiler would do all this for you if you simply go:
But I can't say if that's true or not.
Now you can bsearch that table, and the keys are as short as possible. If you hit the end of the key, you match. If not, then follow the standard bsearch algorithm.
The goal is to get all of the data close together and keep the code itty bitty so that it all fits in to the CPU cache. You can process the key from the program directly, no pre-processing or adding anything up.
For a reasonably large number of keys that are reasonably distributed, I think this would be quite fast. It really depends on the number of strings involved. For smaller numbers, the overhead of computing hash values etc is more than search something like this. For larger values, it's worth it. Just what those number are all depends on the algorithms etc.
This, however, is likely the smallest solution in terms of memory, if that's important.
This also has the benefit of simplicity.
Addenda:
You don't have any specifications on the inputs beyond 'strings'. There's also no discussion about how many strings you expect to use, their length, their commonality or their frequency of use. These can perhaps all be derived from the "source", but not planned upon by the algorithm designer. You're asking for an algorithm that creates something like this:
For a small program that happens to use only one key all the time, all the way up to something that creates a perfect hash algorithm for millions of strings. That's a pretty tall order.
Any design going after "squeezing every single bit of performance possible" needs to know more about the inputs than "any and all strings". That problem space is simply too large if you want it the fastest possible for any condition.
An algorithm that handles strings with extremely long identical prefixes might be quite different than one that works on completely random strings. The algorithm could say "if the key starts with "a", skip the next 100 chars, since they're all a's".
But if these strings are sourced by human beings, and they're using long strings of the same letters, and not going insane trying to maintain that data, then when they complain that the algorithm is performing badly, you reply that "you're doing silly things, don't do that". But we don't know the source of these strings either.
So, you need to pick a problem space to target the algorithm. We have all sorts of algorithms that ostensibly do the same thing because they address different constraints and work better in different situations.
Hashing is expensive, laying out hashmaps is expensive. If there's not enough data involved, there are better techniques than hashing. If you have large memory budget, you could make an enormous state machine, based upon N states per node (N being your character set size -- which you don't specify -- BAUDOT? 7-bit ASCII? UTF-32?). That will run very quickly, unless the amount of memory consumed by the states smashes the CPU cache or squeezes out other things.
You could possibly generate code for all of this, but you may run in to code size limits (you don't say what language either -- Java has a 64K method byte code limit for example).
But you don't specify any of these constraints. So, it's kind of hard to get the most performant solution for your needs.
您想要的是查找表的查找表。
如果内存成本不是问题,您可以全力以赴。
What you want is a look-up table of look-up tables.
If memory cost is not an issue you can go all out.
我认为这都是为了找到正确的哈希函数。只要你事先知道键值关系是什么,你就可以进行分析,尝试找到满足你要求的哈希函数。以您提供的示例为例,将输入字符串视为二进制整数:
所有字符串中的最后一列都不同。进一步分析:
向右移位两次,丢弃溢出,每个值都会得到一个不同的值:
再次向右移位两次,将溢出添加到新的缓冲区:
富= 0010
条=0000
bazz = 0001
您可以使用它们来查询静态 3 条目查找表。我认为这个高度个性化的哈希函数需要 9 个非常基本的操作来获取半字节 (2)、位移位 (2)、位移位和加法 (4) 以及查询 (1),并且其中很多操作都可以通过巧妙的装配使用进一步压缩。这可能比考虑运行时信息更快。
I reckon that it's all about finding the right hash function. As long as you know what the key-value relationship is in advance, you can do an analysis to try and find a hash function to meet your requrements. Taking the example you've provided, treat the input strings as binary integers:
The last column present in all of them is different in each. Analyse further:
Bit-shift to the right twice, discarding overflow, you get a distinct value for each of them:
Bit-shift to the right twice again, adding the overflow to a new buffer:
foo = 0010
bar = 0000
bazz = 0001
You can use those to query a static 3-entry lookup table. I reckon this highly personal hash function would take 9 very basic operations to get the nibble (2), bit-shift (2), bit-shift and add (4) and query (1), and a lot of these operations can be compressed further through clever assembly usage. This might well be faster than taking run-time infomation into account.
您看过 TCB 吗?也许那里使用的算法可以用来检索您的值。这听起来很像您试图解决的问题。根据经验,我可以说 tcb 是我使用过的最快的密钥存储查找之一。这是一个恒定的查找时间,与存储的键的数量无关。
Have you looked at TCB . Perhaps the algorithm used there can be used to retrieve your values. It sounds a lot like the problem you are trying to solve. And from experience I can say tcb is one of the fastest key store lookups I have used. It is a constant lookup time, regardless of the number of keys stored.
考虑使用Knuth–Morris–Pratt 算法。
将给定的地图预处理为如下所示的大字符串
根据 KMP 字符串的预处理时间将需要
O(length)
。使用任何单词/键进行搜索将需要
O(w)
复杂度,其中w
是单词/键的长度。您将需要对
KMP
算法进行 2 处修改:string
中按顺序显示,希望它可以给出很好的提示。
Consider using Knuth–Morris–Pratt algorithm.
Pre-process given map to a large string like below
According KMP preprocessing time for the
string
will takeO(length)
.For searching with any word/key will take
O(w)
complexity, wherew
is length of the word/key.You will be needed to make 2 modification to
KMP
algorithm:string
Wish it can give a good hints.
这是确定哈希例程目标的最小字符子集的可行方法:
let:
k 是所有关键字中不同字符的数量
c 是最大关键字长度
n 是关键字的数量
在您的示例中(带空格的填充较短的关键字):
k = 7 (f,o,b,a,r,z, ), c = 4, n = 3
我们可以使用它来计算搜索的下限。我们至少需要 log_k(n) 个字符来唯一标识一个关键字,如果 log_k(n) >= c 那么您将需要使用整个关键字并且没有理由继续。
接下来,一次消除一列并检查是否还剩下 n 个不同的值。使用每列中的不同字符作为启发式来优化我们的搜索:
首先消除具有最低不同字符的列。如果剩余 <= log_k(n) 列,则可以停止。或者,您可以随机化一点并消除第二个最低的不同列,或者如果消除的列导致少于 n 个不同单词,则尝试恢复。该算法大约是 O(n!),具体取决于您尝试恢复的程度。虽然不能保证找到最佳解决方案,但这是一个很好的权衡。
一旦获得了字符子集,就可以继续执行生成完美哈希的常规例程。结果应该是最佳的完美哈希。
Here's a feasible approach to determine the smallest subset of chars to target for your hash routine:
let:
k be the amount of distinct chars across all your keywords
c be the max keyword length
n be the number of keywords
in your example (padded shorter keywords w/spaces):
k = 7 (f,o,b,a,r,z, ), c = 4, n = 3
We can use this to compute a lower bound for our search. We need at least log_k(n) chars to uniquely identify a keyword, if log_k(n) >= c then you'll need to use the whole keyword and there's no reason to proceed.
Next, eliminate one column at a time and check if there are still n distinct values remaining. Use the distinct chars in each column as a heuristic to optimize our search:
Eliminate columns with the lowest distinct chars first. If you have <= log_k(n) columns remaining you can stop. Optionally you could randomize a bit and eliminate the 2nd lowest distinct col or try to recover if the eliminated col results in less than n distinct words. This algorithm is roughly O(n!) depending on how much you try to recover. It's not guaranteed to find an optimal solution but it's a good tradeoff.
Once you have your subset of chars, proceed with the usual routines for generating a perfect hash. The result should be an optimal perfect hash.