在 PHP 中解析多字节字符串
我想编写一个基于状态机的(HTML)解析器,但我怀疑如何实际读取/使用输入。我决定将整个输入加载到一个字符串中,然后像处理数组一样使用它,并将其索引保留为当前解析位置。
单字节编码不会有问题,但在多字节编码中每个值并不代表一个字符,而是代表一个字符的一个字节。
示例:
$mb_string = 'žščř'; //4 multi-byte characters in UTF-8
for($i=0; $i < 4; $i++)
{
echo $mb_string[$i], PHP_EOL;
}
输出:
Ĺ
ž
Ĺ
Ą
这意味着我无法循环遍历字符串来检查单个字符,因为我永远不知道我是否位于字符的中间。
所以问题是:
- 我如何安全地读取多字节 a 中字符串中的单个字符 性能友好的方式?
- 与 合作是个好主意吗 字符串,因为它是一个数组 案件?
- 您将如何读取输入?
I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.
There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.
Example:
$mb_string = 'žščř'; //4 multi-byte characters in UTF-8
for($i=0; $i < 4; $i++)
{
echo $mb_string[$i], PHP_EOL;
}
Outputs:
Ĺ
ž
Ĺ
Ą
This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.
So the questions are:
- How do I multi-byte safe read a
single character from a string in a
performance friendly way? - Is it good idea to work with the
string as it was an array in this
case? - How would you read the input?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
http://php.net/mb_string 是您要查找的内容,
http://php.net/mb_string is the thing you're looking for
在不使用 mdb_latedFunctions 和多字节编码字符串的情况下,您可以使用标准子字符串函数来读取用于编码的多个字节。
例如,对于 UTF-8 编码(2 字节)字符串,如果您需要字符串中的第一个字符,则
必须获取 $string[0] 和 $string[1] 值,因此您实际上是在索引之间查找子字符串0 和 1(第一个字符)。
请注意,$string[0] 或 $string[N] 将引用第一个(或多字节字符串的第 N 个字节)
,
Without using the mdb_relatedFunctions and with multi-byte encoded strings you can use standard sub string functions that read in multiples of the bytes used for encoding.
For example for a UTF-8 encoded (2 bytes) string if you need the first character from the string
You have to get the $string[0] AND $string[1] values, so you are actually looking for the substring between indexes 0 and 1 (for the first character).
Note that $string[0] or $string[N] will reference the first (or Nth byte of the multi-byte string)
regards,