错误:“输入不是正确的 UTF-8,请指示编码!”使用 PHP 的 simplexml_load_string

发布于 2024-08-26 00:53:49 字数 408 浏览 9 评论 0原文

我收到错误:

解析器错误:输入不是正确的 UTF-8,请指示编码!字节:0xED 0x6E 0x2C 0x20

当尝试使用来自第 3 方源的 simplexml_load_string 处理 XML 响应时。原始 XML 响应确实声明了内容类型:

但 XML 似乎并不是真正的 UTF-8。 XML 内容的语言是西班牙语,并且 XML 中包含诸如 Dublín 之类的单词。

我无法让第三方整理他们的 XML。

如何预处理 XML 并修复编码不兼容问题?

有没有办法检测 XML 文件的正确编码?

I'm getting the error:

parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20

When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:

<?xml version="1.0" encoding="UTF-8"?>

Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.

I'm unable to get the 3rd party to sort out their XML.

How can I pre-process the XML and fix the encoding incompatibilities?

Is there a way to detect the correct encoding for a XML file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

掀纱窥君容 2024-09-02 00:53:49

您的 0xED 0x6E 0x2C 0x20 字节对应于 ISO-8859-1 中的“ín,”,因此看起来您的内容采用 ISO-8859-1,而不是 UTF-8。告诉您的数据提供商并要求他们修复它,因为如果它对您不起作用,它可能对其他人也不起作用。

现在有几种方法可以解决这个问题,只有当您无法正常加载 XML 时才应使用这些方法。其中之一是使用utf8_encode()。缺点是,如果该 XML 同时包含有效的 UTF-8 和一些 ISO-8859-1,那么结果将包含 莫吉贝克。或者您可以尝试使用 iconv() 或 mbstring 将字符串从 UTF-8 转换为 UTF-8,并希望他们能为您解决此问题。 (他们不会,但您至少可以忽略无效字符,以便加载 XML)

或者您可以走很长很长的路,自己验证/修复序列。这将花费您一些时间,具体取决于您对 UTF-8 的熟悉程度。也许有图书馆可以做到这一点,尽管我不知道。

无论哪种方式,请通知您的数据提供商他们正在发送无效数据,以便他们可以修复它。


这是部分修复。它肯定不会解决所有问题,但会解决其中一些问题。希望足以让您度过难关,直到您的提供商修复他们的东西。

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}

Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.

Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)

Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.

Either way, notify your data provider that they're sending invalid data so that they can fix it.


Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}
染墨丶若流云 2024-09-02 00:53:49

我使用解决了这个问题

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);

I solved this using

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
花心好男孩 2024-09-02 00:53:49

如果您确定您的 xml 采用 UTF-8 编码但包含错误字符,您可以使用此函数来纠正它们:

$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);

If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :

$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
停滞 2024-09-02 00:53:49

我们最近遇到了类似的问题,但无法找到任何明显的原因。事实证明,我们的字符串中有一个控制字符,但是当我们将该字符串输出到浏览器时,该字符不可见,除非我们将文本复制到 IDE 中。

感谢这篇文章和以下内容,我们成功解决了问题:

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.

We managed to solve our problem thanks to this post and this:

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

夏夜暖风 2024-09-02 00:53:49

您可以简单地将这行代码放在 mysql_connect 语句后面,而不是使用 javascript:

mysql_set_charset('utf8',$connection);

干杯。

Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:

mysql_set_charset('utf8',$connection);

Cheers.

深爱不及久伴 2024-09-02 00:53:49

您可以在 Firefox 中打开第 3 方 XML 源并查看它自动检测为编码的内容吗?也许他们正在使用普通的旧 ISO-8859-1、UTF-16 或其他格式。

但是,如果他们声明它是 UTF-8,并提供其他内容,那么他​​们的提要显然会被破坏。在这样一个破碎的提要周围工作对我来说感觉很糟糕(尽管有时是不可避免的,我知道)。

如果是像“UTF-8 与 ISO-8859-1”这样的简单情况,您也可以使用 mb_detect_encoding().

Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.

If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).

If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().

雄赳赳气昂昂 2024-09-02 00:53:49

如果您下载 XML 文件并在 Notepad++ 中打开它,您会看到编码设置为 UTF8 以外的其他内容 - 我自己制作的 xml 也遇到了同样的问题,它只是编辑器中的编码:)

String 不设置文档的编码,它只是验证器或其他资源的信息。

If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)

String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.

素罗衫 2024-09-02 00:53:49

我刚刚遇到这个问题。结果 XML 文件(不是内容)不是用 utf-8 编码的,而是用 ISO-8859-1 编码的。您可以在 Mac 上使用 file -I xml_filename 进行检查。

我使用 Sublime 将文件编码更改为 utf-8,lxml 导入它没有问题。

I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.

I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.

山川志 2024-09-02 00:53:49

经过几次尝试,我发现 htmlentities 函数有效。

$value = htmlentities($value)

After several tries i found htmlentities function works.

$value = htmlentities($value)
ゞ花落谁相伴 2024-09-02 00:53:49

我所面临的问题通过埃里克的提议得到了解决
https://stackoverflow.com/a/4575802/14934277
实际上,这是了解您的数据是否可以打印的唯一方法。

这里有一些和平的代码,可能对任何人都有用:

$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash

请注意这一部分。

<![CDATA[]]>

当您尝试从中创建 XML 时,请务必将其传递给浏览器将看到的最终产品,这意味着您的字段用 CDATA 包装

What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.

And here is some peace of code that could be useful to anyone out there:

$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash

Note that part.

<![CDATA[]]>

When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA

淡忘如思 2024-09-02 00:53:49

当使用学说生成映射文件时,我遇到了同样的问题。我通过删除数据库中某些字段的所有注释来修复它。

When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文