如何检测 PHP 中格式错误的 UTF-8 字符串？

发布于 2024-11-24 16:58:27 字数 207 浏览 0 评论 0原文

iconv 函数有时会给我一个错误：

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

在将数据发送到 inconv() 之前，有没有办法检测 UTF-8 字符串中是否存在非法字符？

原文

The iconv function sometimes gives me an error:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

度的依靠╰つ 2024-12-01 16:58:27

首先，请注意，无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。

您可以利用 preg_match 中提供的 UTF-8 有效性检查 ^{[PHP 手册]} 自 PHP 4.3.5 起。如果给出无效字符串，它将返回 0 （没有附加信息）：

$isUTF8 = preg_match('//u', $string);

另一种可能性是 mb_check_encoding ^{[PHP 手册]}：

$validUTF8 = mb_check_encoding($string, 'UTF-8');

您可以使用的另一个函数是 <一个href="http://php.net/manual/en/function.mb-detect-encoding.php" rel="noreferrer">mb_detect_encoding ^{[PHP 手册]< /strong>}：

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

将 strict 参数设置为 true 非常重要。

另外，iconv ^{[PHP 手册]< /strong>} 允许您动态更改/删除无效序列。（但是，如果 iconv 遇到这样的序列，它会生成通知；此行为无法更改。）

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

您可以使用 @ 并检查返回字符串的长度：

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

检查iconv 手册页上的示例也是如此。

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match ^{[PHP Manual]} since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding ^{[PHP Manual]}:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding ^{[PHP Manual]}:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv ^{[PHP Manual]} allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

回复收藏 0 原文

岁月苍老的讽刺 2024-12-01 16:58:27

对于使用 json_encode 的方法，请尝试 json_last_error

<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";

$json  = json_encode($text);
$error = json_last_error();

var_dump($json, $error === JSON_ERROR_UTF8);

输出（例如，对于 PHP 版本 5.3.3 - 5.3.13、5.3.15 - 5.3.29、5.4.0 - 5.4.45）

string(4) "null"
bool(true)

For the one use json_encode, try json_last_error

<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";

$json  = json_encode($text);
$error = json_last_error();

var_dump($json, $error === JSON_ERROR_UTF8);

output (e.g. for PHP versions 5.3.3 - 5.3.13, 5.3.15 - 5.3.29, 5.4.0 - 5.4.45)

string(4) "null"
bool(true)

回复收藏 0 原文

沧桑㈠ 2024-12-01 16:58:27

您可以尝试使用 mb_detect_encoding 来检测是否有不同的字符集（与 UTF-8 不同），然后在需要时使用 mb_convert_encoding 转换为 UTF-8。人们更有可能为您提供不同字符集的有效内容，而不是为您提供无效的 UTF-8。

回复收藏 0 原文

幻想少年梦 2024-12-01 16:58:27

关于 UTF-8 中哪些字符无效的规范非常清楚。您可能想在尝试解析它之前将其删除。它们不应该在那里，所以如果您可以在生成 XML 之前避免它，那就更好了。

请参阅此处以获取参考：

http://www.w3.org/TR/xml/#字符集

这不是完整的列表。许多解析器也不允许一些低编号的控制字符，但我现在找不到完整的列表。

但是， iconv 可能对此有内置支持：

http://www. zeitoun.net/articles/clear-invalid-utf8/start

回复收藏 0 原文

氛圍 2024-12-01 16:58:27

在 iconv() 前面放置一个 @ 以抑制 NOTICE，并在源编码 id 中的 UTF-8 后面放置一个 //IGNORE 以忽略无效字符：

@iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);

Put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in the source encoding id to ignore invalid characters:

@iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);

回复收藏 0 原文

~没有更多了~

关于作者

烟─花易冷

暂无简介

0 文章

0 评论

25 人气

关注发私信

友情链接

文江博客

如何检测 PHP 中格式错误的 UTF-8 字符串？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如何检测 PHP 中格式错误的 UTF-8 字符串？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。