当前位置：文江博客话题详情

[az] 会匹配 PREG/PCRE 中的重音字符吗？

发布于 2024-08-15 13:55:19 字数 547 浏览 5 评论 0 原文

我已经知道 PCRE（特别是 PHP 的实现）中的 \w 有时可以匹配一些非 ASCII 字符，具体取决于系统的区域设置，但是 [az] 又如何呢？ >？

我不这么认为，但我注意到 Drupal 的核心文件之一（includes/theme.inc，简化版）中的这些行：

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

这是真的吗，还是有人只是将 [az] 与 <代码>\w？

原文

I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?

I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

Is this true, or did someone simply get [a-z] confused with \w?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

第七度阳光i 2024-08-22 13:55:19

长话短说：也许，取决于应用程序部署到的系统，取决于PHP是如何编译的，欢迎来到本地化和国际化的CF。

底层 PCRE 引擎在确定“az”的含义时会考虑区域设置。在西班牙语语言环境中，ñ 会被 az 捕获）。 az 的语义是“a 和 z 之间的所有字母，ñ 是西班牙语中的一个单独的字母。

~~但是，PHP 盲目地将字符串处理为字节集合而不是 UTF 代码集合点意味着 az 可能会匹配重音字符~~ 考虑到 Drupal 部署到的各种不同系统，他们选择明确允许的字符而不是仅仅信任 az 是有道理的。

我还推测该正则表达式的存在是由于未过滤德语元音变音而提交的错误报告的结果。

2014 年更新：根据下面的JimmiTh的回答，它看起来像（尽管有一些“令人困惑的非pcre-core-开发人员”文档）众所周知，99% 的情况下，[az] 只会匹配字符 abcdefghijklmnopqrstuvwxyz — 也就是说，框架开发人员往往会对代码中的模糊性感到不安，尤其是这样。当代码依赖于 PHP 不能像您希望的那样优雅地处理的系统（区域设置特定字符串）时，并且开发人员无法控制服务器。虽然匿名 Drupal 开发人员的评论是不正确的 - 这并不是“让 [az] 与 \w 混淆”的问题，而是 Drupal 开发人员不清楚/不确定PCRE 如何处理 [az]，并选择更具体的 abcdefghijklmnopqrstuvwxyz 形式来确保他们想要的特定行为。

回复收藏 0 原文

难如初 2024-08-22 13:55:19

Drupal 代码中的注释是错误。

“国际字符（例如德语变音符号）”可能与 [az] 匹配，这不是正确的。

，如果您有可用的德语语言环境，您可以像这样检查它：

setlocale(LC_ALL, 'de_DE'); // German locale (not needed, but you never know...)
echo preg_match('/^[a-z]+$/', 'abc') ? "yes\n" : "no\n";
echo preg_match('/^[a-z]+$/', "\xE4bc") ? "yes\n" : "no\n"; // äbc in ISO-8859-1
echo preg_match('/^[a-z]+$/',  "\xC3\xA4bc") ? "yes\n" : "no\n"; // äbc in UTF-8
echo preg_match('/^[a-z]+$/u', "\xC3\xA4bc") ? "yes\n" : "no\n"; // w/ PCRE_UTF8

输出（如果您将 de_DE 替换为 de_DE.UTF-8，则不会改变）：

yes
no
no
no

例如字符类 [abcdefghijklmnopqrstuvwxyz] 与 PCRE 理解的两种编码中的 [az] 相同：ASCII 派生的单字节和 UTF-8（也是 ASCII 派生的）。在这两种编码中，[az] 与 [\x61-\x7A] 相同。

当 2009 年提出这个问题时，情况可能有所不同，但在 2014 年，没有“奇怪的配置”可以使 PHP 的 PCRE 正则表达式引擎将 [az] 解释为超过 26 个字符的类（当然，只要 [az] 本身以 ASCII 派生编码形式写入 5 个字节）。

The comment in Drupal's code is WRONG.

It's NOT true that "international characters (e.g. German umlauts)" might match [a-z].

If, e.g., you have the German locale available, you can check it like this:

setlocale(LC_ALL, 'de_DE'); // German locale (not needed, but you never know...)
echo preg_match('/^[a-z]+$/', 'abc') ? "yes\n" : "no\n";
echo preg_match('/^[a-z]+$/', "\xE4bc") ? "yes\n" : "no\n"; // äbc in ISO-8859-1
echo preg_match('/^[a-z]+$/',  "\xC3\xA4bc") ? "yes\n" : "no\n"; // äbc in UTF-8
echo preg_match('/^[a-z]+$/u', "\xC3\xA4bc") ? "yes\n" : "no\n"; // w/ PCRE_UTF8

Output (will not change if you replace de_DE with de_DE.UTF-8):

yes
no
no
no

The character class [abcdefghijklmnopqrstuvwxyz] is identical to [a-z] in both encodings the PCRE understands: ASCII-derived monobyte and UTF-8 (which is ASCII-derived too). In both of these encodings [a-z] is the same as [\x61-\x7A].

Things may have been different when the question was asked in 2009, but in 2014 there is no "weird configuration" that can make PHP's PCRE regex engine interpret [a-z] as a class of more than 26 characters (as long as [a-z] itself is written as 5 bytes in an ASCII-derived encoding, of course).

回复收藏 0 原文

踏月而来 2024-08-22 13:55:19

只是对这两个已经很优秀（尽管相互矛盾）的答案的补充。

PCRE 库的文档始终声明“范围按字符值的整理顺序进行操作”。这有点模糊，但非常精确。

它指的是按 PCRE 内部字符表中的字符索引进行排序，可以使用 pcre_maketables。该函数按照 char 值的顺序构建表 (tolower(i)/toupper(i))

换句话说，它不会按实际的文化排序顺序进行排序 (区域设置整理信息）。例如，虽然德语在字典排序规则中将 ö 视为与 o 相同，但 ö 的值使其出现在用于德语的所有常见字符编码（ISO-8859-x、unicode 编码等）中的 az 范围之外。在这种情况下，PCRE 将根据该代码值来确定 ö 是否在 [az] 范围内，而不是任何实际的区域设置定义的排序顺序。

PHP 大部分在 PCRE 文档 /manual/de/regexp.reference.character-classes.php" rel="nofollow noreferrer">他们的文档。然而，他们实际上煞费苦心地将上述语句更改为“范围按 ASCII 整理顺序进行操作”。该声明至少自 2004 年以来就已出现在文档中。

尽管如此，我不太确定它是否属实。

嗯，至少不是在所有情况下。

PHP 对 pcre_maketables 进行的一次调用...来自 PHP 源：

#if HAVE_SETLOCALE
    if (strcmp(locale, "C"))
        tables = pcre_maketables();
#endif

换句话说，如果编译 PHP 的环境有 setlocale 并且 (LC_CTYPE) 语言环境不是'对于 POSIX/C 语言环境，使用运行时环境的 POSIX/C 语言环境的字符顺序。否则，将使用默认的 PCRE 表 - 这些表是在编译 PCRE 时生成的（由 pcre_maketables） - 基于编译器的区域设置：

此函数为小于 256 的字符值构建一组字符表。这些可以传递给 pcre_compile() 来覆盖 PCRE 的内部内置表（编译 PCRE 时由 pcre_maketables() 创建）。如果您使用非标准区域设置，您可能需要执行此操作。该函数产生一个指向表的指针。

虽然德语对于任何常见字符编码中的 [az] 都没有什么不同，但如果我们处理 EBCDIC，例如，[az] 将包括 ± 和 ~。诚然，EBCDIC 是我能想到的一种不会将 az 和 AZ 放在不间断序列中的字符编码。

除非 PCRE 在使用 EBCDIC 时发挥了一些作用（而且可能），但除了最晦涩的 PHP 构建或运行时环境（使用您自己的、非常特殊的、定制的语言环境定义）之外，您不太可能在任何内容中包含变音符号，在 EBCDIC 的情况下，您可能包含其他非预期字符。对于其他范围，“按 ASCII 序列整理”似乎并不完全准确。

预计到达时间：我本可以通过查找 Philip Hazel 对类似问题的回复来节省一些研究：

<块引用>
<块引用>

另一个问题是字符类范围。您可能认为 [ak] 和 [xz] 对于拉丁脚本来说定义良好，但事实并非如此。

它们当然是明确定义的，相当于 [\x61-\x6b] 和 [\x78-\x7a]，即与代码顺序相关，而不是文化排序顺序。

Just an addition to both the already excellent, if contradicting, answers.

The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.

It refers to collating by the index of characters in PCRE's internal character tables, which can be set up to match the current locale using pcre_maketables. That function builds the tables in order of char value (tolower(i)/toupper(i))

In other words, it doesn't collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range [a-z] on that code value, rather than any actual locale-defined sort order.

PHP has mostly copied PCRE's documentation verbatim in their docs. However, they've actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.

In spite of the above, I'm not quite sure it's true, however.

Well, not in all cases, at least.

The one call PHP makes to pcre_maketables... From the PHP source:

#if HAVE_SETLOCALE
    if (strcmp(locale, "C"))
        tables = pcre_maketables();
#endif

In other words, if the environment for which PHP is compiled has setlocale and the (LC_CTYPE) locale isn't the POSIX/C locale, the runtime environment's POSIX/C locale's character order is used. Otherwise, the default PCRE tables are used - which are generated (by pcre_maketables) when PCRE is compiled - based on the compiler's locale:

This function builds a set of character tables for character values less than 256. These can be passed to pcre_compile() to override PCRE's internal, built-in tables (which were made by pcre_maketables() when PCRE was compiled). You might want to do this if you are using a non-standard locale. The function yields a pointer to the tables.

While German wouldn't be different for [a-z] in any common character encoding, if we were dealing with EBCDIC, for example, [a-z] would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn't place a-z and A-Z in uninterrupted sequence.

Unless PCRE does some magic when using EBCDIC (and it might), while it's highly unlikely you'd be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn't seem entirely accurate.

ETA: I could have saved some research by looking for Philip Hazel's own reply to a similar concern:

Another issue is with character classes ranges. You would think that [a-k] and [x-z] are well defined for latin scripts but that's not the case.

They are certainly well defined, being equivalent to [\x61-\x6b] and [\x78-\x7a], that is, related to code order, not cultural sorting order.