当前位置：文江博客话题详情

错误：“输入不是正确的 UTF-8，请指示编码！”使用 PHP 的 simplexml_load_string

发布于 2024-08-26 00:53:49 字数 408 浏览 9 评论 0原文

我收到错误：

解析器错误：输入不是正确的 UTF-8，请指示编码！字节：0xED 0x6E 0x2C 0x20

当尝试使用来自第 3 方源的 simplexml_load_string 处理 XML 响应时。原始 XML 响应确实声明了内容类型：

但 XML 似乎并不是真正的 UTF-8。 XML 内容的语言是西班牙语，并且 XML 中包含诸如 Dublín 之类的单词。

我无法让第三方整理他们的 XML。

如何预处理 XML 并修复编码不兼容问题？

有没有办法检测 XML 文件的正确编码？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掀纱窥君容 2024-09-02 00:53:49

您的 0xED 0x6E 0x2C 0x20 字节对应于 ISO-8859-1 中的“ín,”，因此看起来您的内容采用 ISO-8859-1，而不是 UTF-8。告诉您的数据提供商并要求他们修复它，因为如果它对您不起作用，它可能对其他人也不起作用。

现在有几种方法可以解决这个问题，只有当您无法正常加载 XML 时才应使用这些方法。其中之一是使用utf8_encode()。缺点是，如果该 XML 同时包含有效的 UTF-8 和一些 ISO-8859-1，那么结果将包含莫吉贝克。或者您可以尝试使用 iconv() 或 mbstring 将字符串从 UTF-8 转换为 UTF-8，并希望他们能为您解决此问题。（他们不会，但您至少可以忽略无效字符，以便加载 XML）

或者您可以走很长很长的路，自己验证/修复序列。这将花费您一些时间，具体取决于您对 UTF-8 的熟悉程度。也许有图书馆可以做到这一点，尽管我不知道。

无论哪种方式，请通知您的数据提供商他们正在发送无效数据，以便他们可以修复它。

这是部分修复。它肯定不会解决所有问题，但会解决其中一些问题。希望足以让您度过难关，直到您的提供商修复他们的东西。

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}

Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.

Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)

Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.

Either way, notify your data provider that they're sending invalid data so that they can fix it.

Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}

回复收藏 0 原文

染墨丶若流云 2024-09-02 00:53:49

我使用解决了这个问题

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);

I solved this using

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);

回复收藏 0 原文

花心好男孩 2024-09-02 00:53:49

如果您确定您的 xml 采用 UTF-8 编码但包含错误字符，您可以使用此函数来纠正它们：

$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);

If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :

$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);

回复收藏 0 原文

停滞 2024-09-02 00:53:49

我们最近遇到了类似的问题，但无法找到任何明显的原因。事实证明，我们的字符串中有一个控制字符，但是当我们将该字符串输出到浏览器时，该字符不可见，除非我们将文本复制到 IDE 中。

感谢这篇文章和以下内容，我们成功解决了问题：

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

回复收藏 0 原文

夏夜暖风 2024-09-02 00:53:49

您可以简单地将这行代码放在 mysql_connect 语句后面，而不是使用 javascript：

mysql_set_charset('utf8',$connection);

干杯。

Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:

mysql_set_charset('utf8',$connection);

Cheers.

回复收藏 0 原文

深爱不及久伴 2024-09-02 00:53:49

您可以在 Firefox 中打开第 3 方 XML 源并查看它自动检测为编码的内容吗？也许他们正在使用普通的旧 ISO-8859-1、UTF-16 或其他格式。

但是，如果他们声明它是 UTF-8，并提供其他内容，那么他们的提要显然会被破坏。在这样一个破碎的提要周围工作对我来说感觉很糟糕（尽管有时是不可避免的，我知道）。

如果是像“UTF-8 与 ISO-8859-1”这样的简单情况，您也可以使用 mb_detect_encoding().

回复收藏 0 原文

雄赳赳气昂昂 2024-09-02 00:53:49

如果您下载 XML 文件并在 Notepad++ 中打开它，您会看到编码设置为 UTF8 以外的其他内容 - 我自己制作的 xml 也遇到了同样的问题，它只是编辑器中的编码:)

String 不设置文档的编码，它只是验证器或其他资源的信息。

回复收藏 0 原文

素罗衫 2024-09-02 00:53:49

我刚刚遇到这个问题。结果 XML 文件（不是内容）不是用 utf-8 编码的，而是用 ISO-8859-1 编码的。您可以在 Mac 上使用 file -I xml_filename 进行检查。

我使用 Sublime 将文件编码更改为 utf-8，lxml 导入它没有问题。

回复收藏 0 原文

山川志 2024-09-02 00:53:49

经过几次尝试，我发现 htmlentities 函数有效。

$value = htmlentities($value)

After several tries i found htmlentities function works.

$value = htmlentities($value)

回复收藏 0 原文

ゞ花落谁相伴 2024-09-02 00:53:49

我所面临的问题通过埃里克的提议得到了解决
https://stackoverflow.com/a/4575802/14934277
实际上，这是了解您的数据是否可以打印的唯一方法。

这里有一些和平的代码，可能对任何人都有用：

$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash

请注意这一部分。

<![CDATA[]]>

当您尝试从中创建 XML 时，请务必将其传递给浏览器将看到的最终产品，这意味着您的字段用 CDATA 包装

What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.

And here is some peace of code that could be useful to anyone out there:

$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash

Note that part.