[az] 会匹配 PREG/PCRE 中的重音字符吗?
我已经知道 PCRE(特别是 PHP 的实现)中的 \w
有时可以匹配一些非 ASCII 字符,具体取决于系统的区域设置,但是 [az]
又如何呢? >?
我不这么认为,但我注意到 Drupal 的核心文件之一(includes/theme.inc,简化版)中的这些行:
// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);
这是真的吗,还是有人只是将 [az]
与 <代码>\w?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
长话短说:也许,取决于应用程序部署到的系统,取决于PHP是如何编译的,欢迎来到本地化和国际化的CF。
底层 PCRE 引擎在确定“az”的含义时会考虑区域设置。在西班牙语语言环境中,ñ 会被 az 捕获)。 az 的语义是“a 和 z 之间的所有字母,ñ 是西班牙语中的一个单独的字母。但是,PHP 盲目地将字符串处理为字节集合而不是 UTF 代码集合点意味着 az 可能会匹配重音字符考虑到 Drupal 部署到的各种不同系统,他们选择明确允许的字符而不是仅仅信任 az 是有道理的。我还推测该正则表达式的存在是由于未过滤德语元音变音而提交的错误报告的结果。
2014 年更新:根据下面的JimmiTh的回答,它看起来像(尽管有一些“令人困惑的非pcre-core-开发人员”文档)众所周知,99% 的情况下,
[az]
只会匹配字符abcdefghijklmnopqrstuvwxyz
— 也就是说,框架开发人员往往会对代码中的模糊性感到不安,尤其是这样。当代码依赖于 PHP 不能像您希望的那样优雅地处理的系统(区域设置特定字符串)时,并且开发人员无法控制服务器。虽然匿名 Drupal 开发人员的评论是不正确的 - 这并不是“让[az]
与\w
混淆”的问题,而是 Drupal 开发人员不清楚/不确定PCRE 如何处理[az]
,并选择更具体的abcdefghijklmnopqrstuvwxyz
形式来确保他们想要的特定行为。Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.
The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that
[a-z]
will only match the charactersabcdefghijklmnopqrstuvwxyz
a proverbial 99% of the time. That said — framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting[a-z]
confused with\w
", but instead a Drupal developer being unclear/unsure of how PCRE handled[a-z]
, and choosing the more specific form ofabcdefghijklmnopqrstuvwxyz
to ensure the specific behavior they wanted.Drupal 代码中的注释是错误。
“
国际字符(例如德语变音符号)
”可能与[az]
匹配,这不是正确的。,如果您有可用的德语语言环境,您可以像这样检查它:
输出(如果您将
de_DE
替换为de_DE.UTF-8
,则不会改变):例如 字符类
[abcdefghijklmnopqrstuvwxyz]
与 PCRE 理解的两种编码中的[az]
相同:ASCII 派生的单字节和 UTF-8(也是 ASCII 派生的)。在这两种编码中,[az]
与[\x61-\x7A]
相同。当 2009 年提出这个问题时,情况可能有所不同,但在 2014 年,没有“奇怪的配置”可以使 PHP 的 PCRE 正则表达式引擎将
[az]
解释为超过 26 个字符的类(当然,只要[az]
本身以 ASCII 派生编码形式写入 5 个字节)。The comment in Drupal's code is WRONG.
It's NOT true that "
international characters (e.g. German umlauts)
" might match[a-z]
.If, e.g., you have the German locale available, you can check it like this:
Output (will not change if you replace
de_DE
withde_DE.UTF-8
):The character class
[abcdefghijklmnopqrstuvwxyz]
is identical to[a-z]
in both encodings the PCRE understands: ASCII-derived monobyte and UTF-8 (which is ASCII-derived too). In both of these encodings[a-z]
is the same as[\x61-\x7A]
.Things may have been different when the question was asked in 2009, but in 2014 there is no "weird configuration" that can make PHP's PCRE regex engine interpret
[a-z]
as a class of more than 26 characters (as long as[a-z]
itself is written as 5 bytes in an ASCII-derived encoding, of course).只是对这两个已经很优秀(尽管相互矛盾)的答案的补充。
PCRE 库的文档始终声明“范围按字符值的整理顺序进行操作”。这有点模糊,但非常精确。
它指的是按 PCRE 内部字符表中的字符索引进行排序,可以使用
pcre_maketables
。该函数按照 char 值的顺序构建表 (tolower(i)
/toupper(i)
)换句话说,它不会按实际的文化排序顺序进行排序 (区域设置整理信息)。例如,虽然德语在字典排序规则中将 ö 视为与 o 相同,但 ö 的值使其出现在用于德语的所有常见字符编码(ISO-8859-x、unicode 编码等)中的 az 范围之外。在这种情况下,PCRE 将根据该代码值来确定 ö 是否在
[az]
范围内,而不是任何实际的区域设置定义的排序顺序。PHP 大部分在 PCRE 文档 /manual/de/regexp.reference.character-classes.php" rel="nofollow noreferrer">他们的文档。然而,他们实际上煞费苦心地将上述语句更改为“范围按 ASCII 整理顺序进行操作”。该声明至少自 2004 年以来就已出现在文档中。
尽管如此,我不太确定它是否属实。
嗯,至少不是在所有情况下。
PHP 对
pcre_maketables
进行的一次调用...来自 PHP 源:换句话说,如果编译 PHP 的环境有
setlocale
并且 (LC_CTYPE) 语言环境不是'对于 POSIX/C 语言环境,使用运行时环境的 POSIX/C 语言环境的字符顺序。否则,将使用默认的 PCRE 表 - 这些表是在编译 PCRE 时生成的(由pcre_maketables
) - 基于编译器的区域设置:虽然德语对于任何常见字符编码中的
[az]
都没有什么不同,但如果我们处理 EBCDIC,例如,[az]
将包括 ± 和 ~。诚然,EBCDIC 是我能想到的一种不会将 az 和 AZ 放在不间断序列中的字符编码。除非 PCRE 在使用 EBCDIC 时发挥了一些作用(而且可能),但除了最晦涩的 PHP 构建或运行时环境(使用您自己的、非常特殊的、定制的语言环境定义)之外,您不太可能在任何内容中包含变音符号,在 EBCDIC 的情况下,您可能包含其他非预期字符。对于其他范围,“按 ASCII 序列整理”似乎并不完全准确。
预计到达时间:我本可以通过查找 Philip Hazel 对类似问题的回复来节省一些研究:
Just an addition to both the already excellent, if contradicting, answers.
The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.
It refers to collating by the index of characters in PCRE's internal character tables, which can be set up to match the current locale using
pcre_maketables
. That function builds the tables in order of char value (tolower(i)
/toupper(i)
)In other words, it doesn't collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range
[a-z]
on that code value, rather than any actual locale-defined sort order.PHP has mostly copied PCRE's documentation verbatim in their docs. However, they've actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.
In spite of the above, I'm not quite sure it's true, however.
Well, not in all cases, at least.
The one call PHP makes to
pcre_maketables
... From the PHP source:In other words, if the environment for which PHP is compiled has
setlocale
and the (LC_CTYPE) locale isn't the POSIX/C locale, the runtime environment's POSIX/C locale's character order is used. Otherwise, the default PCRE tables are used - which are generated (bypcre_maketables
) when PCRE is compiled - based on the compiler's locale:While German wouldn't be different for
[a-z]
in any common character encoding, if we were dealing with EBCDIC, for example,[a-z]
would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn't place a-z and A-Z in uninterrupted sequence.Unless PCRE does some magic when using EBCDIC (and it might), while it's highly unlikely you'd be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn't seem entirely accurate.
ETA: I could have saved some research by looking for Philip Hazel's own reply to a similar concern: