preg_match_all (PHP) 中的 UTF-8 字符
我有 preg_match_all('/[aäeëioöuáéíóú]/u', $in, $out, PREG_OFFSET_CAPTURE);
如果 $in = 'hëllo'
$out 是:
array(1) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(2) "ë"
[1]=>
int(1)
}
[1]=>
array(2) {
[0]=>
string(1) "o"
[1]=>
int(5)
}
}
}
o
的位置应该是 4。我在网上读到过这个问题(ë
被算作 2)。有解决办法吗?我见过 mb_substr
和类似的,但是 preg_match_all
有类似的东西吗?
有点相关:它们相当于 Python 中的 preg_match_all 吗? (返回匹配数组及其在字符串中的位置)
I have preg_match_all('/[aäeëioöuáéíóú]/u', $in, $out, PREG_OFFSET_CAPTURE);
If $in = 'hëllo'
$out
is:
array(1) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(2) "ë"
[1]=>
int(1)
}
[1]=>
array(2) {
[0]=>
string(1) "o"
[1]=>
int(5)
}
}
}
The position of o
should be 4. I've read about this problem online (the ë
gets counted as 2). Is there a solution for this? I've seen mb_substr
and similar, but is there something like this for preg_match_all
?
Kind of related: Is their an equivalent of preg_match_all
in Python? (Returning an array of matches with their position in the string)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这不是一个错误,
PREG_OFFSET_CAPTURE
指的是字符串中字符的字节偏移量。mb_ereg_search_pos
的行为方式相同。一种可能性是先将编码更改为 UTF-32,然后将位置除以 4(因为所有 unicode 代码单元在 UTF-32 中都表示为 4 字节序列):给出:
您还可以将二进制位置转换为代码单元职位。对于 UTF-8,次优实现是:
This is not a bug,
PREG_OFFSET_CAPTURE
refers to the byte offset of the character in the string.mb_ereg_search_pos
behaves the same way. One possibility is to change the encoding to UTF-32 before and then divide the position by 4 (because all unicode code units are represented as 4-byte sequences in UTF-32):gives:
You could also convert the binary positions into code unit positions. For UTF-8, a suboptimal implementation is:
有一个简单的解决方法,可以在 preg_match() 结果匹配后使用。您需要迭代每个匹配结果并使用以下命令重新分配位置值:
在 Windows 下的 php 5.4 上测试,仅依赖于多字节 PHP 扩展。
There is simple workaround, to be used after preg_match() results matched. You need to iterate every match result and reassign position value with following:
Tested on php 5.4 under Windows, depends on Multibyte PHP extension only.
PHP 不太支持 unicode,因此许多字符串函数(包括 preg_*)仍然计算字节而不是字符。
我尝试通过对字符串进行编码和解码来找到解决方案,但最终一切都归结为 preg_match_all 函数。
关于Python的事情:Python正则表达式匹配对象默认包含匹配位置mo.start()和mo.end()。请参阅:http://docs.python。 org/library/re.html#finding-all-adverbs-and-their-positions
PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.
I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.
About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions
通过正则表达式分割 UTF-8
$string
的另一种方法是使用函数preg_split()
。这是我的工作解决方案:PHP 5.3.17
Another way how to split UTF-8
$string
by a regular expression is to use functionpreg_split()
. Here is my working solution:PHP 5.3.17