博耶摩尔寻找小钥匙
首先,我对算法知之甚少,所以请耐心等待。
据我了解,Boyer Moore 算法在长密钥下速度最快。如果我有一个非常短的密钥(例如 10 个字符)和大量要搜索的文本(超过 10,000 个字符)怎么办? Boyer Moore 会是这种情况下的最佳搜索算法吗?
如果不是的话会是什么?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据字符串搜索算法,“Boyer–Moore 字符串搜索算法已成为实用字符串搜索文献。”它并不总是最快的,但总的来说这是可行的方法。
当您谈论像 10,000 个字符这样的小型文本缓冲区时,Knuth-Morris-Pratt 和 Boyer-Moore 的运行时间非常接近。当在现代计算机上运行时,即使是简单的字符串搜索在 10K 缓冲区上也会快得令人眼花缭乱。我怀疑您会发现 KMP 和 Boyer-Moore 两者在 10,000 个字符缓冲区中搜索 10 个字符的字符串之间的差异约为纳秒。
这种情况下最好的搜索算法是什么?这将取决于您需要调用它的频率。如果它每秒被调用几次(最多),我可能会编写一个简单的搜索并保留它。与程序的运行时间相比,Boyer-Moore 搜索和在那个小缓冲区上的朴素搜索之间的差异是微不足道的,并且您的优化工作最好花在其他地方。如果我必须每秒调用它数百或数千次,我会花时间编写优化的 Boyer-Moore 搜索。
According to String searching algorithm, "The Boyer–Moore string search algorithm has been the standard benchmark for the practical string search literature." It's not always the fastest, but in general it's the way to go.
Knuth-Morris-Pratt and Boyer-Moore are very close in running time when you're talking about a small text buffer like 10,000 characters. Even a naive string search will be blindingly fast on a 10K buffer when run on a modern computer. I suspect you'll find that the difference between the KMP and Boyer-Moore two searching for a 10-character string in a 10,000 character buffer will be on the order of nanoseconds.
The best search algorithm in this scenario? That's going to depend on how often you need to call it. If it's something that gets called a few times a second (at most), I'd probably write a naive search and leave it at that. The difference between Boyer-Moore and naive search on that small buffer would be insignificant compared to the running time of your program, and your optimization effort would be better spent somewhere else. If I had to call it hundreds or thousands of times a second, I'd take the time to write an optimized Boyer-Moore search.
要回答您的问题,您只需访问一个链接: http ://www.codeproject.com/Articles/250566/Fastest-strstr-like-function-in-C
事实上最快的文本搜索器的主页是这样的:
http://www.sanmayce.com/Railgun/index.html
> ;如果我有一个非常短的密钥(例如 10 个字符)和大量要搜索的文本(超过 10,000 个字符)怎么办?
正是这个关键范围(10-11 个字符)在我的重型 (1TB) strstr 摊牌中进行了测试。在 300,000 个单词中搜索了 400,000 个单词,逐一搜索,这对于 strstr 函数来说是一个极好的负载!
>Boyer Moore 会是这种情况下的最佳搜索算法吗?
根据我的测试,最快的文本搜索器(使用 Microsoft V16 /Ox)是 Railgun_Quadruplet_7Gulliver,但当使用 /Ox 和 Intel 12.1 时,它是第二好的,您可以自己测试一下,见下文。
如果您有一台速度很快的机器,比如 i7 37*,我想看看您最新的控制台基准测试包的结果(测试 Microsoft v16 与 Intel 12.1 编译器):
http://www.sanmayce.com/Downloads/_KAZE_Benchmark_2013.zip
在我的 T7500 上测试2200Mhz笔记本:
使用32位和64位代码时有一些起伏也很重要微软与微软之间的利润率英特尔。
Gulliver的“引擎”是由我调校的2号BM-Horspool。
正如您在我简陋的笔记本电脑上看到的那样,Gulliver 以 1898MB/s 的速度搜索模式(平均模式长度:11),甚至是超级漂亮的 BNDM在这里弯曲膝盖。
To answer your question you need to visit one link only: http://www.codeproject.com/Articles/250566/Fastest-strstr-like-function-in-C
In fact the homepage of fastest textual searcher is this:
http://www.sanmayce.com/Railgun/index.html
>So what if I have a very shor key (example 10 chars), and alot of text to search (over 10,000 chars).
Exactly this key range (10-11 chars) is tested in my heavy (1TB) strstr showdown. There 400,000 words are searched in 300,000 words, ONE-BY-ONE, an excellent load for a strstr function!
>Would Boyer Moore be the best search algorithm for this scenario?
According to my tests the fastest text searcher (using Microsoft V16 /Ox) is Railgun_Quadruplet_7Gulliver, yet it is second best when /Ox is used and Intel 12.1, you can test it yourself, see below.
If you have a fast machine, say i7 37*, I would like to see your results of my latest console benchmark package (testing Microsoft v16 vs Intel 12.1 compilers):
http://www.sanmayce.com/Downloads/_KAZE_Benchmark_2013.zip
Test on my T7500 2200Mhz notebook:
There are some up-and-downs when 32bit and 64bit code is used also significant margins between Microsoft & Intel.
Gulliver's "engine" is order 2 BM-Horspool tuned by me.
As you can see on my humble laptop Gulliver searches for patterns (Average Pattern Length: 11) at 1898MB/s, even the super beautiful BNDM bends a knee here.