如何检测 PHP 中格式错误的 UTF-8 字符串?
iconv 函数有时会给我一个错误:
Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]
在将数据发送到 inconv() 之前,有没有办法检测 UTF-8 字符串中是否存在非法字符?
The iconv function sometimes gives me an error:
Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]
Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
首先,请注意,无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。
您可以利用
preg_match 中提供的 UTF-8 有效性检查
[PHP 手册] 自 PHP 4.3.5 起。如果给出无效字符串,它将返回0
(没有附加信息):另一种可能性是
mb_check_encoding
[PHP 手册]:您可以使用的另一个函数是 <一个href="http://php.net/manual/en/function.mb-detect-encoding.php" rel="noreferrer">
mb_detect_encoding
[PHP 手册]< /strong>:将
strict
参数设置为true
非常重要。另外,
iconv
[PHP 手册]< /strong> 允许您动态更改/删除无效序列。 (但是,如果iconv
遇到这样的序列,它会生成通知;此行为无法更改。)您可以使用
@
并检查返回字符串的长度:检查iconv 手册页上的示例也是如此。
First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
You can make use of the UTF-8 validity check that is available in
preg_match
[PHP Manual] since PHP 4.3.5. It will return0
(with no additional information) if an invalid string is given:Another possibility is
mb_check_encoding
[PHP Manual]:Another function you can use is
mb_detect_encoding
[PHP Manual]:It's important to set the
strict
parameter totrue
.Additionally,
iconv
[PHP Manual] allows you to change/drop invalid sequences on the fly. (However, ificonv
encounters such a sequence, it generates a notification; this behavior cannot be changed.)You can use
@
and check the length of the return string:Check the examples on the
iconv
manual page as well.对于使用 json_encode 的方法,请尝试 json_last_error
输出(例如,对于 PHP 版本 5.3.3 - 5.3.13、5.3.15 - 5.3.29、5.4.0 - 5.4.45)
For the one use json_encode, try json_last_error
output (e.g. for PHP versions 5.3.3 - 5.3.13, 5.3.15 - 5.3.29, 5.4.0 - 5.4.45)
您可以尝试使用
mb_detect_encoding
来检测是否有不同的字符集(与 UTF-8 不同),然后在需要时使用mb_convert_encoding
转换为 UTF-8。人们更有可能为您提供不同字符集的有效内容,而不是为您提供无效的 UTF-8。You could try using
mb_detect_encoding
to detect if you've got a different character set (than UTF-8) thenmb_convert_encoding
to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.关于 UTF-8 中哪些字符无效的规范非常清楚。您可能想在尝试解析它之前将其删除。它们不应该在那里,所以如果您可以在生成 XML 之前避免它,那就更好了。
请参阅此处以获取参考:
http://www.w3.org/TR/xml/#字符集
这不是完整的列表。许多解析器也不允许一些低编号的控制字符,但我现在找不到完整的列表。
但是, iconv 可能对此有内置支持:
http://www. zeitoun.net/articles/clear-invalid-utf8/start
The specification on which characters that are invalid in UTF-8 is pretty clear. You probably want to strip those out before trying to parse it. They shouldn't be there, so if you could avoid it even before generating the XML that would be even better.
See here for a reference:
http://www.w3.org/TR/xml/#charsets
That isn't a complete list. Many parsers also disallow some low-numbered control characters, but I can't find a comprehensive list right now.
However, iconv might have built-in support for this:
http://www.zeitoun.net/articles/clear-invalid-utf8/start
在 iconv() 前面放置一个
@
以抑制 NOTICE,并在源编码 id 中的 UTF-8 后面放置一个 //IGNORE 以忽略无效字符:Put an
@
in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in the source encoding id to ignore invalid characters: