“正则表达式” VS“字符串比较运算符/函数”
这个问题是围绕 PHP 中的性能设计的,但如果您愿意,您可以将其扩展到任何语言。
在使用 PHP 多年并必须比较字符串之后,我了解到,在正则表达式上使用字符串比较运算符对于提高性能是有益的。
我完全理解某些操作必须使用正则表达式来完成,其复杂性很高,但对于可以通过正则表达式和字符串函数解决的操作。
举个例子:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
完成
可以很容易地用PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
还有很多其他的例子,但你应该得到我想表达的就是这一点。
您应该尝试并倾向于哪种版本的字符串比较?为什么?
This question is designed around the performance within PHP but you may broaden it to any language if you wish to.
After many years of using PHP and having to compare strings I've learned that using string comparison operators over regular expressions is beneficial when it comes to performance.
I fully understand that some operations have to be done with Regular Expressions down to there complexity but for operations that can be resolved via regex AND string functions.
take this example:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
can easily be done with
PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
There are many other examples but you should get the point I'm trying to make.
What version of string comparison should you try and lean towards and why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
看起来这个问题是由我们的小争论这里引起的,所以我觉得自己有义务做出回应。
php 开发人员正在积极地被“性能”洗脑,由此产生了许多谣言和神话,包括诸如“双引号速度较慢”之类的愚蠢的事情。正则表达式“慢”是这些神话之一,不幸的是手册支持它(请参阅 preg_match 页面上臭名昭著的评论)。事实是,在大多数情况下你并不关心。除非您的代码重复 10,000 次,否则您甚至不会注意到字符串函数和正则表达式之间的差异。如果你的代码确实重复了 10,000 次,那么无论如何你一定做错了什么,你将通过优化逻辑而不是通过剥离正则表达式来获得性能。
至于可读性,正则表达式确实很难阅读,但是,使用它们的代码在大多数情况下更短、更干净、更简单(在上面的链接上比较你和我的答案)。
另一个重要的问题是灵活性,尤其是在 php 中,其字符串库不支持开箱即用的 unicode。在您的具体示例中,当您决定将站点迁移到 utf8 时会发生什么?使用
ctype_alpha
你有点运气不好,preg_match
需要另一种模式,但会继续工作。因此,正则表达式并不更慢、更易读且更灵活。我们到底为什么要避开它们?
Looks like this question arose from our small argument here, so i feel myself somehow obliged to respond.
php developers are being actively brainwashed about "performance", whereat many rumors and myths arise, including sheer stupid things like "double quotes are slower". Regexps being "slow" is one of these myths, unfortunately supported by the manual (see infamous comment on the preg_match page). The truth is that in most cases you don't care. Unless your code is repeated 10,000 times, you don't even notice a difference between string function and a regular expression. And if your code does repeat 10,000 times, you must be doing something wrong in any case, and you will gain performance by optimizing your logic, not by stripping down regular expressions.
As for readability, regexps are admittedly hard to read, however, the code that uses them is in most cases shorter, cleaner and simpler (compare yours and mine answers on the above link).
Another important concern is flexibility, especially in php, whose string library doesn't support unicode out of the box. In your concrete example, what happens when you decide to migrate your site to utf8? With
ctype_alpha
you're kinda out of luck,preg_match
would require another pattern, but will keep working.So, regexes are not slower, more readable and more flexible. Why on earth should we avoid them?
当正则表达式可以取代多个原子字符串比较时,它们实际上会带来性能提升(并不是说这种微优化在任何方面都是明智的)。因此,通常大约五次 strpos() 检查,建议使用正则表达式。更重要的是为了可读性。
这里有另一个想法来总结:PCRE 处理条件的速度比 Zend 内核处理 IF 字节码的速度更快。
但并非所有正则表达式的设计都是相同的。如果复杂性太高,正则表达式递归可能会消除其性能优势。因此,将正则表达式匹配和常规 PHP 字符串函数混合使用通常值得重新考虑。适合工作的工具。
Regular expressions actually lead to a performance gain (not that such microoptimizations are in any way sensible) when they can replace multiple atomic string comparisons. So typically around five strpos() checks it gets advisable to use a regular expression instead. Moreso for readability.
And here's another thought to round things up: PCRE can handle conditionals faster than the Zend kernel can handle IF bytecode.
Not all regular expressions are designed equal, though. If the complexetiy gets too high, regex recursion can kill its performance advantage. Therefore it's often reconsiderworthy to mix regex matching and regular PHP string functions. Right tool for the job and all.
当匹配简单时,PHP 本身建议使用字符串函数而不是正则表达式函数。例如,来自
preg_match
手册页:或者来自
str_replace
手册页:然而,我发现人们尝试使用字符串函数来解决通过正则表达式可以更好地解决的问题。例如,当尝试创建全字字符串匹配器时,我遇到过有人尝试使用
strpos($string, " $word ")
(注意空格),以求“性能” ”,而不需要停下来思考空格为何不是描述单词的唯一方法(想想需要多少个字符串函数调用才能完全替换preg_match('/\bword\b/', $string) )。
我个人的立场是使用字符串函数来匹配静态字符串(即不同字符序列的匹配,其中匹配始终相同),并使用正则表达式来匹配其他所有内容。
PHP itself recommends using string functions over regex functions when the match is straightforward. For example, from the
preg_match
manual page:Or from the
str_replace
manual page:However, I find that people try to use the string functions to solve problems that would be better solved by regex. For instance, when trying to create a full-word string matcher, I have encountered people trying to use
strpos($string, " $word ")
(note the spaces), for the sake of "performance", without stopping to think about how spaces aren't the only way to delineate a word (think about how many string functions calls would be needed to fully replacepreg_match('/\bword\b/', $string)
).My personal stance is to use string functions for matching static strings (ie. a match of a distinct sequence of characters where the match is always the same) and regular expressions for everything else.
同意 PHP 人们倾向于过分强调一个函数的性能而不是另一个函数的性能。这并不意味着性能差异不存在——它们确实存在——但大多数 PHP 代码(实际上是大多数代码)都比选择正则表达式而不是字符串有更更糟糕的瓶颈 -比较。要找出瓶颈所在,请使用 xdebug 的分析器。在担心微调各行代码之前先解决它出现的问题。
Agreed that PHP people tend to over-emphasise performance of one function over another. That doesn't mean the performance differences don't exists -- they definitely do -- but most PHP code (and indeed most code in general) has much worse bottlenecks than the choice of regex over string-comparison. To find out where your bottlenecks are, use xdebug's profiler. Fix the issues it comes up with before worrying about fine-tuning individual lines of code.
它们都是语言的一部分是有原因的。 IsAlpha 更具表现力。例如,当您正在查看的表达式本质上是否为 alpha,并且具有领域含义时,请使用它。
但是,如果它是输入验证,并且可能会更改为包含下划线、破折号等,或者如果它是需要正则表达式的其他逻辑,那么我会使用正则表达式。对我来说,这往往是大部分时间。
They're both part of the language for a reason. IsAlpha is more expressive. For example, when an expression you're looking at is inherently alpha or not, and that has domain meaning, then use it.
But if it is, say, an input validation, and could possibly be changed to include underscores, dashes, etc., or if it is with other logic that requires regex, then I would use regex. This tends to be the majority of the time for me.