Boyer-Moore 在 C# 中实用吗?
Boyer-Moore 可能是已知最快的非索引文本搜索算法。因此,我在我的 Black Belt Coder 网站上使用 C# 实现它。
我让它工作了,与 String.IndexOf()
相比,它大致显示了预期的性能改进。但是,当我将 StringComparison.Ordinal
参数添加到 IndexOf
时,它开始优于我的 Boyer-Moore 实现。有时,数量相当大。
我想知道是否有人可以帮我找出原因。我明白为什么 StringComparision.Ordinal
可能会加快速度,但它怎么可能比 Boyer-Moore 更快呢?是因为 .NET 平台本身的开销,也许是因为必须验证数组索引以确保它们在范围内,还是因为其他原因。是否有些算法在 C#.NET 中不实用?
下面是关键代码。
// Base for search classes
abstract class SearchBase
{
public const int InvalidIndex = -1;
protected string _pattern;
public SearchBase(string pattern) { _pattern = pattern; }
public abstract int Search(string text, int startIndex);
public int Search(string text) { return Search(text, 0); }
}
/// <summary>
/// A simplified Boyer-Moore implementation.
///
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
private byte[] _skipArray;
public BoyerMoore2(string pattern)
: base(pattern)
{
// TODO: To be replaced with multi-stage table
_skipArray = new byte[0x10000];
for (int i = 0; i < _skipArray.Length; i++)
_skipArray[i] = (byte)_pattern.Length;
for (int i = 0; i < _pattern.Length - 1; i++)
_skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
}
public override int Search(string text, int startIndex)
{
int i = startIndex;
// Loop while there's still room for search term
while (i <= (text.Length - _pattern.Length))
{
// Look if we have a match at this position
int j = _pattern.Length - 1;
while (j >= 0 && _pattern[j] == text[i + j])
j--;
if (j < 0)
{
// Match found
return i;
}
// Advance to next comparision
i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
}
// No match found
return InvalidIndex;
}
}
编辑:我已将有关此事的所有测试代码和结论发布在 http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore。
Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.
I had it working and it showed roughly the expected performance improvements compared to String.IndexOf()
. However, when I added the StringComparison.Ordinal
argument to IndexOf
, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.
I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal
might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?
Below is the key code.
// Base for search classes
abstract class SearchBase
{
public const int InvalidIndex = -1;
protected string _pattern;
public SearchBase(string pattern) { _pattern = pattern; }
public abstract int Search(string text, int startIndex);
public int Search(string text) { return Search(text, 0); }
}
/// <summary>
/// A simplified Boyer-Moore implementation.
///
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
private byte[] _skipArray;
public BoyerMoore2(string pattern)
: base(pattern)
{
// TODO: To be replaced with multi-stage table
_skipArray = new byte[0x10000];
for (int i = 0; i < _skipArray.Length; i++)
_skipArray[i] = (byte)_pattern.Length;
for (int i = 0; i < _pattern.Length - 1; i++)
_skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
}
public override int Search(string text, int startIndex)
{
int i = startIndex;
// Loop while there's still room for search term
while (i <= (text.Length - _pattern.Length))
{
// Look if we have a match at this position
int j = _pattern.Length - 1;
while (j >= 0 && _pattern[j] == text[i + j])
j--;
if (j < 0)
{
// Match found
return i;
}
// Advance to next comparision
i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
}
// No match found
return InvalidIndex;
}
}
EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据我自己的测试和此处所做的评论,我得出结论,
String.IndexOf()
在StringComparision.Ordinal
中执行得如此出色的原因是该方法调用可能采用手工优化的汇编语言的非托管代码。我已经运行了许多不同的测试,
String.IndexOf()
似乎比我使用托管 C# 代码实现的任何东西都要快。如果有人感兴趣,我已经写下了我发现的所有相关内容,并在 http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore。
Based on my own tests and the comments made here, I've concluded that the reason
String.IndexOf()
performs so well withStringComparision.Ordinal
is because the method calls into unmanaged code that likely employs hand-optimized assembly language.I have run a number of different tests and
String.IndexOf()
just seems to be faster than anything I can implement using managed C# code.If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.
我敢打赌,设置该标志允许 String.IndexOf 使用 Boyer-Moore 本身。而且它的实现比你的更好。
如果没有该标志,则必须小心使用 Boyer-Moore(并且可能不会),因为围绕 Unicode 存在潜在问题。特别是 Unicode 的可能性会导致 Boyer-Moore 使用的转换表崩溃。
My bet is that setting that flag allows String.IndexOf to use Boyer-Moore itself. And its implementation is better than yours.
Without that flag it has to be careful using Boyer-Moore (and probably doesn't) because of potential issues around Unicode. In particular the possibility of Unicode causes the transition tables that Boyer-Moore uses to blow up.
我对您的代码做了一些小的更改,并对 Boyer-Moore 算法进行了不同的实现,并得到了更好的结果。
我从这里得到了这个实现的想法,
但说实话,与
相比,我希望达到更高的速度>索引
。更改了
Form1
的代码:I made some small changes to your code, and made a different implementation to the Boyer-Moore algorithm and got better results.
I got the idea for this implementation from here
But to be honest, I would expect to reach a higher speed compared to
IndexOf
.Changed code from
Form1
: