在使用 preg_match() 匹配之前获取多字节字符计数(PREG_OFFSET_CAPTURE 参数对字节计数没有帮助)
我正在尝试使用 preg_match。
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];
这应该打印 1,因为“H”位于字符串“¡Hola!”中的索引 1。但它打印 2。所以看起来它没有将主题视为 UTF8 编码的字符串,即使我传递了“u” 正则表达式中的修饰符。
我的 php.ini 中有以下设置,并且其他 UTF8 函数正在运行:
mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off
有什么想法吗?
I'm trying to search a UTF8-encoded string using preg_match.
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];
This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.
I have the following settings in my php.ini, and other UTF8 functions are working:
mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
尽管 u 修饰符使模式和主题都被解释为 UTF-8,但捕获的偏移量仍然以字节为单位计算。
您可以使用 mb_strlen 来获取 UTF-8 字符而不是字节的长度:
Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.
You can use
mb_strlen
to get the length in UTF-8 characters rather than bytes:尝试在正则表达式之前添加此(*UTF8):
Magic,感谢评论
https://www.php.net/manual/function.preg- match.php#95828
Try adding this (*UTF8) before the regex:
Magic, thanks to a comment in
https://www.php.net/manual/function.preg-match.php#95828
看起来这是一个“功能”,请参阅
http://bugs.php.net/bug.php?id=37391
'u' 开关仅对 PCRE 有意义,PHP 本身不知道它。
从 PHP 的角度来看,字符串是字节序列,返回字节偏移量似乎是合乎逻辑的(我不是说“正确”)。
Looks like this is a "feature", see
http://bugs.php.net/bug.php?id=37391
'u' switch only makes sense for pcre, PHP itself is unaware of it.
From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").
请原谅我的死后发布,但可能有人会发现它很有用:下面的代码可以作为 preg_match 和 preg_match_all 函数的替代,并返回正确的匹配与 UTF8 编码字符串的 正确 偏移量。
我的示例的输出:
Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correct offset for UTF8-encoded strings.
Output of my example:
您可以通过使用字节计数
substr
将字符串切割到preg_mach
返回的偏移量,然后使用正确计数测量此前缀来计算真正的 UTF-8 偏移量mb_strlen
。You can calculate the real UTF-8 offset by cutting the string to the offset returned by the
preg_mach
with the byte-countingsubstr
and then measuring this prefix with the correct-countingmb_strlen
.如果您只想找到 H 的多字节安全位置,请尝试 mb_strpos()
输出:
If all you want to do is find the multi-byte safe position of H try mb_strpos()
Output:
我编写了一个小类,将 preg_match 返回的偏移量转换为正确的 utf 偏移量:
您可以这样使用它:
https://3v4l。 org/8Y32J
I wrote small class to convert offsets returned by preg_match to proper utf offsets:
You can use it like that:
https://3v4l.org/8Y32J
您可能想查看 T-Regx 库。
此
$match->offset()
是 UTF-8 安全偏移量。You might want to look at T-Regx library.
This
$match->offset()
is UTF-8 safe offset.我只需使用随意的 substr 而不是预期的 mb_substr (PHP 7.4)就解决了这个问题。
当文本包含欧元符号(€ )。
另外 iconv 和 utf8_encode 也没有帮助,我无法使用 htmlentities 。
只需恢复到简单的 substr 就会有所帮助,并且它可以正确地处理 € 和其他字符。
The problem was solved to me just by using casual substr instead of expected mb_substr (PHP 7.4).
The mb_substr together with preg_match_all / PREG_OFFSET_CAPTURE (despite using or not using /u modifier)resulted in incorrect position when text contained euro sign symbol (€).
Also iconv and utf8_encode did not help, and I was not able to use htmlentities.
Just reverting to simple substr helped, and it worked with € and other characters correctly.
我认为在这种情况下使用
PREG_OFFSET_CAPTURE
只会带来更多工作。以下脚本的演示。
如果模式仅包含文字字符,则
preg_
就太过分了,只需使用mb_strpos()
并记住返回值将为false
如果大海捞针没有找到。如果您知道大海捞针将存在,则可以将
preg_match_all()
与奇妙的\G
(继续)元字符和\X
一起使用>(多字节任意字符)元字符。如果你不知道大海捞针是否存在,只需检查返回的计数是否等于输入字符串的多字节长度。
但是,如果您无论如何都要调用额外的
mb_
函数,那么只需进行一次匹配,检查是否匹配,如果匹配则测量其多字节长度。综上所述,我从未见过需要计算某些内容的多字节位置,除非更大的任务是隔离或替换子字符串。如果是这种情况,请完全避免此步骤,只需使用
preg_match()
或preg_replace()
来更直接地满足您的需求。I think working with
PREG_OFFSET_CAPTURE
in this case only creates more work.Demo of below scripts.
If the pattern only contains literal characters, then
preg_
is overkill, just usemb_strpos()
and bear in mind that the returned value will befalse
if the needle is not found in the haystack.If you know that the needle will exist in the haystack, you can use
preg_match_all()
with the marvellous\G
(continue) metacharacter and\X
(multibyte any character) metacharacter.If you don't know if the needle will exist in the haystack, just check if the returned count is equal to the multibyte length of the input string.
But if you are going to call an extra
mb_
function anyhow, then make just one match, check if a match was made, and measure its multibyte length if so.All this said, I've never seen the need to count the multibyte position of something unless the greater task was to isolate or replace a substring. If this is the case, avoid this step entirely and just use
preg_match()
orpreg_replace()
to more directly serve your needs.