Perl 正则表达式在多个字符集实例上阻塞
我一开始在 php 中使用 preg_replace 时遇到了一些疯狂的失败,并将其归结为有多个字符类同时使用土耳其点分“i”和无点“ı”的问题情况。这是 php 中的一个简单测试用例:
<?php
echo 'match single normal i: ';
$str = 'mi';
echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";
echo 'match single undotted ı: ';
$str = 'mı';
echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";
echo 'match double normal i: ';
$str = 'misir';
echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
echo 'match double undotted ı: ';
$str = 'mısır';
echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
?>
在 perl 中再次进行相同的测试用例:
#!/usr/bin/perl
$str = 'mi';
$str =~ m/m[ıi]/ && print "match single normal i\n";
$str = 'mı';
$str =~ m/m[ıi]/ && print "match single undotted ı\n";
$str = 'misir';
$str =~ m/m[ıi]s[ıi]r/ && print "match double normal i\n";
$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";
前三个测试工作正常。最后一项不匹配。
为什么这作为字符类一次可以正常工作,但在同一表达式中第二次却不能?如何编写一个表达式来匹配这样的单词,无论它是用什么字母组合编写的,都需要匹配?
编辑:语言问题的背景我正在尝试编程。
编辑 2: 添加 use utf8;
指令确实修复了 perl 版本。由于我最初的问题是与 php 程序有关,而我只是切换到 perl 来查看它是否是 php 中的错误,这对我没有多大帮助。 有人知道让 PHP 不会因此而窒息的指令吗?
I started out with some crazy failures using preg_replace in php and boiled it down to the problem case of having more than one character class using turkish dotted "i" and undotted "ı" together. Here is a simple test case in php:
<?php
echo 'match single normal i: ';
$str = 'mi';
echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";
echo 'match single undotted ı: ';
$str = 'mı';
echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";
echo 'match double normal i: ';
$str = 'misir';
echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
echo 'match double undotted ı: ';
$str = 'mısır';
echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
?>
And the same test case again in perl:
#!/usr/bin/perl
$str = 'mi';
$str =~ m/m[ıi]/ && print "match single normal i\n";
$str = 'mı';
$str =~ m/m[ıi]/ && print "match single undotted ı\n";
$str = 'misir';
$str =~ m/m[ıi]s[ıi]r/ && print "match double normal i\n";
$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";
The first three tests work fine. The last one does not match.
Why does this work fine as a character class once but not the second time in the same expression? How do I write an expression to match for a word like this that needs to match no matter what combinations of letters it is written with?
Edit: Background on the language problem I'm trying to program for.
Edit 2: Adding a use utf8;
directive does fix the perl version. Since my original problem was with a php program and I only switched to perl to see if it was a bug in php, that doesn't help me a whole lot. Does anybody know the directive to make PHP not choke on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能需要告诉 Perl 您的源文件包含 utf8 字符。尝试:
这对 PHP 没有帮助,但 PHP 中可能有类似的指令。否则,请尝试使用某种形式的转义序列来避免将文字字符放入源代码中。我对PHP一无所知,所以我无能为力。
编辑
我读到 PHP 不支持 Unicode。因此,您传递给它的 unicode 输入可能会被视为 unicode 编码的字节字符串。
如果您可以确信您的输入是 utf-8,那么您可以匹配
ı
的 utf-8 序列,即\xc4 \xb1
,如下所示:那有用吗?
再次编辑:
我可以解释为什么你的前三个测试通过。假设在您的编码中,
ı
被编码为ABCDE
。然后 PHP 会看到以下内容:这使得前三个测试通过而最后一个测试失败的原因显而易见。如果您使用开始/结束锚点
^...$
我想您会发现只有第一个测试通过。You may need to tell Perl that your source file contains utf8 characters. Try:
Which doesn't help you with PHP but there may be a similar directive in PHP. Otherwise, try using some form of escape-sequence to avoid putting the literal character in your source-code. I know nothing about PHP so I can't help with that.
Edit
I'm reading that PHP has no Unicode support. Therefore, the unicode input you pass it is likely treated as the string of bytes that the unicode was encoded as.
If you can be assured that your input is coming in as utf-8 then you can match for the utf-8 sequence for
ı
which is\xc4 \xb1
as in:Does that work?
Edit again:
I can explain why your first three tests pass. Let's pretend that in your encoding,
ı
is encoded asABCDE
. then PHP sees the following:which makes it obvious why the first three tests pass and the last one fails. If you use a start/end anchor
^...$
I think you'll find that only the first test passes.如果 UTF-8 被错误解释为 8 位字节序列,则多字节序列不会在括号内的 char 类中执行您想要的操作。想一想。如果
[nñm]
不是被错误地构造为三个逻辑字符而是四个物理字节,则您将仅匹配代码点为 6E 或 C3 或 B1 或 6D 的字符。出于某些目的,您可能会将
[nñm]
重写为(?:n|ñ|m)
。这仅取决于您在做什么。外壳的东西不会起作用。此外,Unicode 对于土耳其语无点 i 有特殊的大小写规则。
听起来 PHP 还不够现代。叹。
Multibyte sequences won’t do what you want in bracketed char classes if the UTF-8 is being mis-interpreted as a sequence of 8-bit bytes. Think about it. If
[nñm]
is misconstructed not as three logical characters but as four physical bytes, you would only match a character whose code point is 6E or C3 or B1 or 6D.For some purposes, you might get away with rewriting
[nñm]
as(?:n|ñ|m)
. It just depends what you’re doing. Casing stuff won’t work.Also, Unicode has special casing rules for a Turkish dotless i.
Sounds like PHP just isn’t modern enough. Sigh.