Perl 正则表达式在多个字符集实例上阻塞

发布于 2024-10-04 00:48:26 字数 1417 浏览 6 评论 0原文

我一开始在 php 中使用 preg_replace 时遇到了一些疯狂的失败,并将其归结为有多个字符类同时使用土耳其点分“i”和无点“ı”的问题情况。这是 php 中的一个简单测试用例:

<?php
    echo 'match single normal i: ';
    $str = 'mi';
    echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";

    echo 'match single undotted ı: ';
    $str = 'mı';
    echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";

    echo 'match double normal i: ';
    $str = 'misir';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";

    echo 'match double undotted ı: ';
    $str = 'mısır';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
?>

在 perl 中再次进行相同的测试用例:

#!/usr/bin/perl

$str = 'mi';
$str =~ m/m[ıi]/ && print "match single normal i\n";

$str = 'mı';
$str =~ m/m[ıi]/ && print "match single undotted ı\n";

$str = 'misir';
$str =~ m/m[ıi]s[ıi]r/ && print "match double normal i\n";

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

前三个测试工作正常。最后一项不匹配。

为什么这作为字符类一次可以正常工作,但在同一表达式中第二次却不能?如何编写一个表达式来匹配这样的单词,无论它是用什么字母组合编写的,都需要匹配?

编辑:语言问题的背景我正在尝试编程。

编辑 2: 添加 use utf8; 指令确实修复了 perl 版本。由于我最初的问题是与 php 程序有关,而我只是切换到 perl 来查看它是否是 php 中的错误,这对我没有多大帮助。 有人知道让 PHP 不会因此而窒息的指令吗?

I started out with some crazy failures using preg_replace in php and boiled it down to the problem case of having more than one character class using turkish dotted "i" and undotted "ı" together. Here is a simple test case in php:

<?php
    echo 'match single normal i: ';
    $str = 'mi';
    echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";

    echo 'match single undotted ı: ';
    $str = 'mı';
    echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";

    echo 'match double normal i: ';
    $str = 'misir';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";

    echo 'match double undotted ı: ';
    $str = 'mısır';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
?>

And the same test case again in perl:

#!/usr/bin/perl

$str = 'mi';
$str =~ m/m[ıi]/ && print "match single normal i\n";

$str = 'mı';
$str =~ m/m[ıi]/ && print "match single undotted ı\n";

$str = 'misir';
$str =~ m/m[ıi]s[ıi]r/ && print "match double normal i\n";

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

The first three tests work fine. The last one does not match.

Why does this work fine as a character class once but not the second time in the same expression? How do I write an expression to match for a word like this that needs to match no matter what combinations of letters it is written with?

Edit: Background on the language problem I'm trying to program for.

Edit 2: Adding a use utf8; directive does fix the perl version. Since my original problem was with a php program and I only switched to perl to see if it was a bug in php, that doesn't help me a whole lot. Does anybody know the directive to make PHP not choke on this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

新一帅帅 2024-10-11 00:48:26

您可能需要告诉 Perl 您的源文件包含 utf8 字符。尝试:

#!/usr/bin/perl

use utf8;   # **** Add this line

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

这对 PHP 没有帮助,但 PHP 中可能有类似的指令。否则,请尝试使用某种形式的转义序列来避免将文字字符放入源代码中。我对PHP一无所知,所以我无能为力。

编辑
我读到 PHP 不支持 Unicode。因此,您传递给它的 unicode 输入可能会被视为 unicode 编码的字节字符串。

如果您可以确信您的输入是 utf-8,那么您可以匹配 ı 的 utf-8 序列,即 \xc4 \xb1,如下所示:

$str = 'mısır';  # Make sure this source-file is encoded as utf-8 or this match will fail
echo (preg_match('!m(i|\xc4\xb1)s(i|\xc4\xb1)r!', $str)) ? "ok\n" : "fail\n";

那有用吗?

再次编辑:
我可以解释为什么你的前三个测试通过。假设在您的编码中,ı 被编码为 ABCDE。然后 PHP 会看到以下内容:

echo 'match single normal i: ';
$str = 'mi';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n";

echo 'match single undotted ABCDE: ';
$str = 'mABCDE';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n";

echo 'match double normal i: ';
$str = 'misir';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

echo 'match double undotted ABCDE: ';
$str = 'mABCDEsABCDEr';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

这使得前三个测试通过而最后一个测试失败的原因显而易见。如果您使用开始/结束锚点 ^...$ 我想您会发现只有第一个测试通过。

You may need to tell Perl that your source file contains utf8 characters. Try:

#!/usr/bin/perl

use utf8;   # **** Add this line

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

Which doesn't help you with PHP but there may be a similar directive in PHP. Otherwise, try using some form of escape-sequence to avoid putting the literal character in your source-code. I know nothing about PHP so I can't help with that.

Edit
I'm reading that PHP has no Unicode support. Therefore, the unicode input you pass it is likely treated as the string of bytes that the unicode was encoded as.

If you can be assured that your input is coming in as utf-8 then you can match for the utf-8 sequence for ı which is \xc4 \xb1 as in:

$str = 'mısır';  # Make sure this source-file is encoded as utf-8 or this match will fail
echo (preg_match('!m(i|\xc4\xb1)s(i|\xc4\xb1)r!', $str)) ? "ok\n" : "fail\n";

Does that work?

Edit again:
I can explain why your first three tests pass. Let's pretend that in your encoding, ı is encoded as ABCDE. then PHP sees the following:

echo 'match single normal i: ';
$str = 'mi';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n";

echo 'match single undotted ABCDE: ';
$str = 'mABCDE';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n";

echo 'match double normal i: ';
$str = 'misir';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

echo 'match double undotted ABCDE: ';
$str = 'mABCDEsABCDEr';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

which makes it obvious why the first three tests pass and the last one fails. If you use a start/end anchor ^...$ I think you'll find that only the first test passes.

无人接听 2024-10-11 00:48:26

如果 UTF-8 被错误解释为 8 位字节序列,则多字节序列不会在括号内的 char 类中执行您想要的操作。想一想。如果 [nñm] 不是被错误地构造为三个逻辑字符而是四个物理字节,则您将仅匹配代码点为 6E 或 C3 或 B1 或 6D 的字符。

出于某些目的,您可能会将 [nñm] 重写为 (?:n|ñ|m)。这仅取决于您在做什么。外壳的东西不会起作用。

此外,Unicode 对于土耳其语无点 i 有特殊的大小写规则。

听起来 PHP 还不够现代。叹。

Multibyte sequences won’t do what you want in bracketed char classes if the UTF-8 is being mis-interpreted as a sequence of 8-bit bytes. Think about it. If [nñm] is misconstructed not as three logical characters but as four physical bytes, you would only match a character whose code point is 6E or C3 or B1 or 6D.

For some purposes, you might get away with rewriting [nñm] as (?:n|ñ|m). It just depends what you’re doing. Casing stuff won’t work.

Also, Unicode has special casing rules for a Turkish dotless i.

Sounds like PHP just isn’t modern enough. Sigh.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文