strstr 的优化版本(搜索具有恒定长度)
我的 C 程序有很多 strstr 函数调用。标准库 strstr 已经很快,但在我的例子中,搜索字符串的长度始终为 5 个字符。我用一个特殊的版本替换它以提高速度:
int strstr5(const char *cs, const char *ct) { while (cs[4]) { if (cs[0] == ct[0] && cs[1] == ct[1] && cs[2] == ct[2] && cs[3] == ct[3] && cs[4] == ct[4]) return 1; cs++; } return 0; }
该函数返回一个整数,因为它足以知道 ct 是否出现在 cs 中。在这种特殊情况下,我的函数比标准 strstr 更简单且更快,但我很想听听是否有人有一些可以应用的性能改进。即使是很小的改进也是受欢迎的。
摘要:
- cs 的长度 >=10,但其他方面可能会有所不同。长度之前已知(在我的函数中未使用)。 cs 的长度通常从 100 到 200。
- ct 的长度为 5
- 字符串的内容可以是任何内容
编辑:感谢您的所有回答和评论。我必须研究和测试想法,看看什么最有效。我将从 MAK 关于后缀 trie 的想法开始。
My C program had a lot of strstr function calls. The standard library strstr is already fast but in my case the search string has always length of 5 characters. I replaced it with a special version to gain some speed:
int strstr5(const char *cs, const char *ct) { while (cs[4]) { if (cs[0] == ct[0] && cs[1] == ct[1] && cs[2] == ct[2] && cs[3] == ct[3] && cs[4] == ct[4]) return 1; cs++; } return 0; }
The function returns an integer because it’s enough to know if ct occurs in cs. My function is simple and faster than standard strstr in this special case but I’m interested to hear if anybody has some performance improvements that could be applied. Even small improvements are welcome.
Summary:
- cs has length of >=10, but otherwise it can vary. Length is known before (not used in my function). Length of cs is usually from 100 to 200.
- ct has length of 5
- Content of strings can be anything
Edit: Thank you for all answers and comments. I have to study and test ideas to see what works best. I will start with MAK's idea about suffix trie.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
有几种快速的字符串搜索算法。尝试查看 Boyer-Moore (正如 Greg Hewgill 已经建议的那样), Rabin-Karp 和 KMP 算法。
如果您需要在同一大文本正文中搜索许多小模式,您还可以尝试实现 后缀树 或后缀数组。但恕我直言,这些有点难以理解和正确实施。
但要注意,这些技术非常快,但只有在涉及的字符串非常大时才会带来明显的加速。对于长度小于 1000 个字符的字符串,您可能看不到明显的加速。
编辑:
如果您一遍又一遍地搜索相同的文本(即
cs
的值在调用之间始终/经常相同),则通过使用后缀特里树(基本上),您将获得很大的加速后缀的trie)。由于您的文本只有 100 或 200 个字符,因此您可以使用更简单的 O(n^2) 方法来构建 trie,然后对其进行多次快速搜索。每次搜索只需要 5 次比较,而不是通常的 5*200 次。编辑2:
正如caf的评论所提到的,C的
strstr
算法是依赖于实现的。 glibc 使用线性时间算法,在实践中该算法应该或多或少与我提到的任何方法一样快。虽然OP的方法渐近较慢(O(N*m)而不是O(n)),但它更快可能是因为n和m(模式和文本的长度)都非常小并且它不必在 glibc 版本中进行任何长时间的预处理。There are several fast string search algorithms. Try looking at Boyer-Moore (as already suggested by Greg Hewgill), Rabin-Karp and KMP algorithms.
If you need to search for many small patterns in the same large body of text, you can also try implementing a suffix tree or a suffix array. But these are IMHO somewhat harder to understand and implement correctly.
But beware, these techniques are very fast, but only give you an appreciable speedup if the strings involved are very large. You might not see an appreciable speedup for strings less than say a 1000 characters long.
EDIT:
If you are searching on the same text over and over again (i.e. the value of
cs
is always/often the same across calls), you will get a big speedup by using a suffix trie (Basically a trie of suffixes). Since your text is as small as 100 or 200 characters, you can use the simpler O(n^2) method to build the trie and then do multiple fast searches on it. Each search would require only 5 comparisons instead of the usual 5*200.Edit 2:
As mentioned by caf's comment, C's
strstr
algorithm is implementations dependent. glibc uses a linear time algorithm which should be more or less as fast in practice as any of the methods I've mentioned. While the OP's method is asymptotically slower (O(N*m) instead of O(n) ), it is faster probably due to the fact that both n and m (the lengths of the pattern and the text) are very small and it does not have to do any of the long preprocessing in the glibc version.减少比较次数将提高搜索速度。保留字符串的运行 int 并将其与搜索项的固定 int 进行比较。如果匹配则比较最后一个字符。
添加对短 cs 的检查。
编辑:
添加了评论中的修复。谢谢。
这可以很容易地被采用以使用 64 位值。您可以将 cs[4] 和 ct[4] 存储在局部变量中,而不是假设编译器会为您执行此操作。您可以在循环之前将 4 添加到 cs 和 ct,并在循环中使用 cs[0] 和 ct[0]。
Reducing the number of comparisons will increase the speed of the search. Keep a running int of the string and compare it to a fixed int for the search term. If it matches compare the last character.
Add checks for a short cs.
Edit:
Added fixes from comments. Thanks.
This could easily be adopted to use 64 bit values. You could store cs[4] and ct[4] in local variables instead of assuming the compiler will do that for you. You could add 4 to cs and ct before the loop and use cs[0] and ct[0] in the loop.
strstr 的接口施加了一些可以打破的限制。它采用以 null 结尾的字符串,任何首先对其目标执行“strlen”的竞争对手都会失败。它不需要“状态”参数,因此设置成本不能在具有(例如)相同目标或模式的许多调用中摊销。它预计适用于广泛的输入,包括非常短的目标/模式和病理数据(考虑在“ABABABABAB...C”字符串中搜索“ABABAC”)。 libc 现在也依赖于平台。在 x86-64 世界中,SSE2 已有 7 年历史,并且使用 SSE2 的 libc 的 strlen 和 strchr 比朴素算法快 6-8 倍。在支持 SSE4.2 的 Intel 平台上,strstr 使用 PCMPESTRI 指令。但你也可以打败它。
Boyer-Moore 的(以及 Turbo BM 和 Backward Oracle Matching 等)的设置时间几乎使他们无法运行,甚至不计算空终止字符串问题。 Horspool 是一种受限的 BM,在实践中运行良好,但在边缘情况下表现不佳。我在该领域发现的最好的方法是 BNDM(“向后非确定性定向非循环词图匹配”),其实现比其名称要小:-)
以下是一些可能感兴趣的代码片段。
智能 SSE2 击败幼稚 SSE4.2,并处理空终止问题。
BNDM 实现展示了一种保持设置成本。如果您熟悉 Horspool,您会注意到相似之处,只不过 BNDM 使用位掩码而不是跳过偏移量。
我即将发布如何(有效地)解决 Horspool 和 BNDM 等后缀算法的空终止符问题。
所有好的解决方案的一个共同属性是针对不同的参数长度分成不同的算法。一个例子是 Sanmayce 的 "Railgun “函数。
strstr's interface impose some constraints that can be beaten. It takes null-terminated strings, and any competitor that first does a "strlen" of its target will lose. It takes no "state" argument, so set-up costs can't be amortized across many calls with (say) the same target or pattern. It is expected to work on a wide range of inputs, including very short targets/patterns, and pathological data (consider searching for "ABABAC" in a string of "ABABABABAB...C"). libc is also now platform-dependent. In the x86-64 world, SSE2 is seven years old, and libc's strlen and strchr using SSE2 are 6-8 time faster than naive algorithms. On Intel platforms that support SSE4.2, strstr uses the PCMPESTRI instruction. But you can beat that, too.
Boyer-Moore's (and Turbo B-M, and Backward Oracle Matching, et al) have set-up time that pretty much knock them out of the running, not even counting the null-terminated-string problem. Horspool is a restricted B-M that works well in practice, but doesn't do the edge cases well. Best I've found in that field is BNDM ("Backward Nondeterministic Directed-Acyclic-Word-Graph Matching"), whose implementation is smaller than its name :-)
Here are a couple of code snippets that might be of interest.
Intelligent SSE2 beats naive SSE4.2, and handles the null-termination problem.
A BNDM implementation shows one way of keeping set-up costs. If you're familiar with Horspool, you'll notice the similarity, except that BNDM uses bitmasks instead of skip-offsets.
I'm about to post how to solve the null-terminator problem (efficiently) for suffix algorithms like Horspool and BNDM.
A common attribute of all good solutions is splitting into different algorithms for different argument lengths. An example of is Sanmayce's "Railgun" function.
如果
cs
少于 4 个字符,您的代码可能会超出其分配范围来访问cs
。字符串搜索的常见优化是使用 Boyer-Moore 算法来开始查找在
cs
中,从ct
的 end 开始。有关算法的完整说明,请参阅链接页面。Your code may access
cs
beyond the bounds of its allocation ifcs
is shorter than 4 characters.A common optimisation for string search is to use the Boyer-Moore algorithm where you start looking in
cs
from the end of what would bect
. See the linked page for a full description of the algorithm.现代 x86 计算机上的良好实现是无与伦比的。
新的英特尔处理器有一条指令,它需要您正在检查的字符串的 16 个字节,最多 16 个字节的搜索字符串,并在一条指令中返回搜索字符串可能所在的第一个字节位置(或者如果没有) )。例如,如果您在字符串“abcdefghijklmnHexyz”中搜索“Hello”,第一条指令将告诉您字符串“Hello”可能从偏移量 14 开始(因为读取 16 个字节,处理器有字节H, e, 未知可能是“Hello”的位置。从偏移量 14 开始的下一条指令告诉我们该字符串不存在。是的,它知道尾随零字节
。 两条 指令来查找 19 个字符的字符串中是否存在 5 个字符的字符串,尝试使用任何特殊情况代码来击败它(显然,这是专门为 strstr、strcmp 和类似指令构建的)。
You won't beat a good implementation on a modern x86 computer.
New Intel processors have an instruction that takes 16 bytes of the string you are examining, up to 16 bytes of the search string, and in a single instruction returns which is the first byte position where the search string could be (or if there is none). For example if you search for "Hello" in the string "abcdefghijklmnHexyz" the first instruction will tell you that the string "Hello" might start at offset 14 (because reading 16 bytes, the processor has the bytes H, e, unknown which might be the location of "Hello". The next instruction starting at offset 14 then tells that the string isn't there. And yes, it knows about trailing zero bytes.
That's two instructions to find that a five character string is not present in a 19 character string. Try beating that with any special case code. (Obviously this is built specifically for strstr, strcmp and similar instructions).