语言或库之间的正则表达式性能
我找不到关于这个主题的任何内容,所以我想知道是否有人比较过不同语言之间正则表达式匹配的速度。我想知道哪种语言可以更快地进行正则表达式评估,因为在我当前的项目中,我需要不断评估大量的正则表达式。语言的选择将主要基于此性能。
我的想法是 C/C++ 自然会更快,但我想尽可能避免它,我不确定我是否正确。例如,C# 库可能将本机代码与 P/Invoke 一起使用,因此速度差异可能会很荒谬。但我不知道该选择哪个库,或者是否需要围绕 C++ 库创建一个包装器(哪个?)。
I couldn't find anything about this subject, so I wonder if anyone has compared the speed of regex matching among different languages. I would like to know which language proceeds regex evaluations faster because in my current project, I need to evaluate an enormous amount of regular expressions constantly. The choice of the language will be mainly based on this performance.
My idea is that C/C++ will be naturally faster but I want to avoid it if possible, and I'm not sure if I'm right. For example a C# library may use native code with P/Invoke and so the speed difference may be ridiculous. But I don't know what library to choose, or if I need to create a wrapper around a C++ library (which one?).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
什么类型的正则表达式?他们会使用前瞻、后瞻、反向引用、勉强量词、原子组、所有格量词等功能吗?
其他响应者已链接到 regex-dna 基准,但它只使用所有正则表达式风格共享的最基本功能,例如 Kleene 星号 (
*
) 和交替 (|
)。因此,虽然 GNU C/C++ 实现似乎是明显的赢家,但如果您需要我上面列出的任何功能,它们不会给您带来任何好处。另一个考虑因素是 Unicode 支持。如果您处理的是实际文本(而不是表示为文本的数据,如 regex-dna 基准测试中的数据),则应使用具有良好 Unicode 支持的正则表达式风格。
我建议你研究一下 C#。 .NET 正则表达式风格并不以速度慢而闻名(在我看来,这是您可以说的关于正则表达式速度的唯一明智的事情),并且对于性能关键型应用程序,它提供了 直接编译为字节代码以显着提高性能。
What kind of regexes? Will they use features like lookaheads, lookbehinds, backreferences, reluctant quantifiers, atomic groups, possessive quantifiers, etc., etc.?
Other responders have linked to the regex-dna benchmark, but it only uses the most basic features shared by all regex flavors, like the Kleene star (
*
) and alternation (|
). So, while the GNU C/C++ implementations seem to be the clear winners, they won't do you any good if you need any of the features I listed above.Another consideration is Unicode support. If you're dealing with actual text (and not data represented as text, like in the
regex-dna
benchmark), you should use a regex flavor with good Unicode support.I suggest you look into C#. The .NET regex flavor does not have a reputation for being slow (which is the only sensible thing you can say about regex speeds IMO), and for performance-critical applications it provides the option of compiling directly to byte code for a substantial performance boost.
这里有一个正则表达式基准: http: //shootout.alioth.debian.org/u64q/benchmark.php?test=regexdna&lang=all&box=1
但是您将要使用的正则表达式类型可能比您的正则表达式更重要发动机的选择。对于某些类型,某些引擎比其他引擎做得更好,并且无论引擎是什么,某些类型的正则表达式都很慢(例如某些正则表达式可能需要指数时间)
There is a regex benchmark here: http://shootout.alioth.debian.org/u64q/benchmark.php?test=regexdna&lang=all&box=1
But the types of regex you are going to be using could potentially matter a lot more than your choice of engine. Some engines do better than others for certain types, and some types of regex are slow no matter what the engine (e.g. certain regex can necessitate exponential time)
我建议在 RegExBuddy 中评估复杂的正则表达式。
尝试使用您想要测试的语言。它以毫秒为单位显示速度。相信我,这是一个很棒的工具。
I will suggest evaluating a complex Regular Expression in RegExBuddy .
Try in languages you want to Test . It shows speed in ms. Believe me , it's a great tool .
那么您的选择可能取决于正则表达式引擎的选择。
您的程序会在单核机器还是多核机器上运行,还是x86 还是x64?
Then your choice may come down to choice of regex engine.
Will your program run on single core machines or multi core, or x86 or x64?