如何检测 PHP 中格式错误的 UTF-8 字符串?

发布于 2024-11-24 16:58:27 字数 207 浏览 0 评论 0原文

iconv 函数有时会给我一个错误:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

在将数据发送到 inconv() 之前,有没有办法检测 UTF-8 字符串中是否存在非法字符?

The iconv function sometimes gives me an error:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

度的依靠╰つ 2024-12-01 16:58:27

首先,请注意,无法检测文本是否属于特定的不需要的编码。您只能检查字符串在给定编码中是否有效。

您可以利用 preg_match 中提供的 UTF-8 有效性检查 [PHP 手册] 自 PHP 4.3.5 起。如果给出无效字符串,它将返回 0 (没有附加信息):

$isUTF8 = preg_match('//u', $string);

另一种可能性是 mb_check_encoding [PHP 手册]

$validUTF8 = mb_check_encoding($string, 'UTF-8');

您可以使用的另一个函数是 <一个href="http://php.net/manual/en/function.mb-detect-encoding.php" rel="noreferrer">mb_detect_encoding [PHP 手册]< /strong>

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

strict 参数设置为 true 非常重要。

另外,iconv [PHP 手册]< /strong> 允许您动态更改/删除无效序列。 (但是,如果 iconv 遇到这样的序列,它会生成通知;此行为无法更改。)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

您可以使用 @ 并检查返回字符串的长度:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

检查iconv 手册页上的示例也是如此。

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding [PHP Manual]:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding [PHP Manual]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

岁月苍老的讽刺 2024-12-01 16:58:27

对于使用 json_encode 的方法,请尝试 json_last_error

<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";

$json  = json_encode($text);
$error = json_last_error();

var_dump($json, $error === JSON_ERROR_UTF8);

输出(例如,对于 PHP 版本 5.3.3 - 5.3.13、5.3.15 - 5.3.29、5.4.0 - 5.4.45)

string(4) "null"
bool(true)

For the one use json_encode, try json_last_error

<?php
// An invalid UTF8 sequence
$text = "\xB1\x31";

$json  = json_encode($text);
$error = json_last_error();

var_dump($json, $error === JSON_ERROR_UTF8);

output (e.g. for PHP versions 5.3.3 - 5.3.13, 5.3.15 - 5.3.29, 5.4.0 - 5.4.45)

string(4) "null"
bool(true)
沧桑㈠ 2024-12-01 16:58:27

您可以尝试使用 mb_detect_encoding 来检测是否有不同的字符集(与 UTF-8 不同),然后在需要时使用 mb_convert_encoding 转换为 UTF-8。人们更有可能为您提供不同字符集的有效内容,而不是为您提供无效的 UTF-8。

You could try using mb_detect_encoding to detect if you've got a different character set (than UTF-8) then mb_convert_encoding to convert to UTF-8 if required. It's more likely that people are giving you valid content in a different character set than giving you invalid UTF-8.

幻想少年梦 2024-12-01 16:58:27

关于 UTF-8 中哪些字符无效的规范非常清楚。您可能想在尝试解析它之前将其删除。它们不应该在那里,所以如果您可以在生成 XML 之前避免它,那就更好了。

请参阅此处以获取参考:

http://www.w3.org/TR/xml/#字符集

这不是完整的列表。许多解析器也不允许一些低编号的控制字符,但我现在找不到完整的列表。

但是, iconv 可能对此有内置支持:

http://www. zeitoun.net/articles/clear-invalid-utf8/start

The specification on which characters that are invalid in UTF-8 is pretty clear. You probably want to strip those out before trying to parse it. They shouldn't be there, so if you could avoid it even before generating the XML that would be even better.

See here for a reference:

http://www.w3.org/TR/xml/#charsets

That isn't a complete list. Many parsers also disallow some low-numbered control characters, but I can't find a comprehensive list right now.

However, iconv might have built-in support for this:

http://www.zeitoun.net/articles/clear-invalid-utf8/start

氛圍 2024-12-01 16:58:27

在 iconv() 前面放置一个 @ 以抑制 NOTICE,并在源编码 id 中的 UTF-8 后面放置一个 //IGNORE 以忽略无效字符:

@iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);

Put an @ in front of iconv() to suppress the NOTICE and an //IGNORE after UTF-8 in the source encoding id to ignore invalid characters:

@iconv('UTF-8//IGNORE', $destinationEncoding, $yourString);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文