在正则表达式中应该转义哪些文字字符?
我刚刚编写了一个与 php 函数 preg_match
一起使用的正则表达式,其中包含以下部分:
[\w-.]
匹配任何单词字符,以及减号和点。虽然它似乎在 preg_match 中工作,但我尝试将其放入名为 Reggy 的实用程序中,它抱怨” char 类中的空范围”。反复试验告诉我,这个问题是通过转义减号解决的,将正则表达式变成
[\w\-.]
由于原始版本似乎在 PHP 中工作,我想知道为什么我应该或不应该转义减号,并且 - 因为点是也是一个在 PHP 中有意义的字符 - 为什么我不需要转义点。我使用的实用程序是否只是愚蠢,它是否与另一种正则表达式方言一起使用,或者我的正则表达式真的不正确,我只是幸运的是 preg_match 让我逃脱了它吗?
I just wrote a regex for use with the php function preg_match
that contains the following part:
[\w-.]
To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into
[\w\-.]
Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
在许多正则表达式实现中,适用以下规则:
字符类中的元字符为:
^
(否定)-
(范围)]
(结束类的)\
(escape char)所以这些都应该被转义。但也有一些特殊情况:
-
如果放置在类的开头或结尾([abc-]
或[-abc] 则不需要转义)。在相当多的正则表达式实现中,当直接放置在范围 (
[ac-abc]
) 或简写字符类 ([\w-abc])。这就是您观察到的
^
当不在类开头时不需要转义:[^a]
表示除a
和[a^]
匹配a
或^
,等于:[\^a]
]
如果它是类中唯一的字符,则不需要转义:[]]
与字符]
匹配In many regex implementations, the following rules apply:
Meta characters inside a character class are:
^
(negation)-
(range)]
(end of the class)\
(escape char)So these should all be escaped. There are some corner cases though:
-
needs no escaping if placed at the very start, or end of the class ([abc-]
or[-abc]
). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]
) or short-hand character class ([\w-abc]
). This is what you observed^
needs no escaping when it's not at the start of the class:[^a]
means any char excepta
, and[a^]
matches eithera
or^
, which equals:[\^a]
]
needs no escaping if it's the only character in the class:[]]
matches the char]
.
通常表示任何字符,但[]
之间没有特殊含义-
[]
之间表示一个范围,除非如果它被转义或者[]
之间的第一个或最后一个字符.
usually means any character but between[]
has no special meaning-
between[]
indicates a range unless if it's escaped or either first or last character between[]
虽然确实有一些字符应该在正则表达式中转义,你问的不是正则表达式而是字符类。其中破折号是特殊的。
您可以将其放在课程末尾,而不是转义它,
[\w.-]
While there are indeed some characters should be escaped in a regex, you're asking not about regex but about character class. Where dash symbol being special one.
instead of escaping it you could put it at the end of class,
[\w.-]
句号在字符类中失去了其元含义。
-
在字符类中具有特殊含义。如果它没有放置在方括号的开头或结尾,则必须对其进行转义。否则它表示字符范围 (AZ
)。然而,您触发了另一个特殊情况。
[\w-.]
有效,因为\w
不表示单个字符。因此 PCRE 不可能创建字符范围。\w
是一类可能不一致的符号,因此没有可用于创建Z 到 .
范围的结束字符。此外,句号.
将位于\w
可以匹配的第一个 ascii 字符a
之前。没有可构造的范围。因此,为什么-
可以在没有转义的情况下为您工作。The full stop loses its meta meaning in the character class.
The
-
has special meaning in the character class. If it isn't placed at the start or at the end of the square brackets, it must be escaped. Otherwise it denotes a character range (A-Z
).You triggered another special case however.
[\w-.]
works because\w
does not denote a single character. As such PCRE can not possibly create a character range.\w
is a possibly non-coherent class of symbols, so there is no end-character which could be used to create the rangeZ till .
. Also the full stop.
would preceed the first ascii charactera
that\w
could match. There is no range constructable. Hencewhy-
worked without escaping for you.如果您使用 php 并且需要转义特殊的正则表达式字符,只需使用
preg_quote
:来自 php.net:
If you are using php and you need to escape special regex chars, just use
preg_quote
:An example from php.net: