无法解析格式错误的 XML

发布于 2024-08-06 10:07:37 字数 916 浏览 13 评论 0原文

我一直在尝试解析此提要。如果您单击该链接,您会发现它甚至无法在浏览器中正确解析它。

无论如何,我的托管服务不允许我使用 simplexml_load_file,所以我一直使用 cURL 来获取它,然后将字符串加载到 DOM 中,如下所示:

$dom = new DOMDocument;
$dom->loadXML($rawXML);
if (!$dom) {
 echo 'Error while parsing the document';
 exit;
}
$xml = simplexml_import_dom($dom);

但我收到错误 ("DOMDocument::loadXML() [domdocument.loadxml ]:实体'nbsp'未在实体中定义”),然后我尝试使用SimpleXMLElement,但没有成功(它显示相同的错误“解析器错误:实体'nbsp'未定义”等......因为该元素中的HTML )。

$xml = new SimpleXMLElement($rawXML);

所以我的问题是,如何跳过/忽略/删除该元素以便可以解析其余数据?


编辑:感谢 mjv 提供解决方案!...我刚刚这样做了(对于其他有同样问题的人)

$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML);
$rawXML = str_replace('</description>',']]></description>',$rawXML);

I've been trying to parse this feed. If you click on that link, you'll notice that it can't even parse it correctly in the browser.

Anyway, my hosting service won't let me use simplexml_load_file, so I've been using cURL to get it then loading the string into the DOM, like this:

$dom = new DOMDocument;
$dom->loadXML($rawXML);
if (!$dom) {
 echo 'Error while parsing the document';
 exit;
}
$xml = simplexml_import_dom($dom);

But I get errors ("DOMDocument::loadXML() [domdocument.loadxml]: Entity 'nbsp' not defined in Entity"), then I tried using SimpleXMLElement without luck (it shows the same error "parser error : Entity 'nbsp' not defined", etc... because of the HTML in that one element).

$xml = new SimpleXMLElement($rawXML);

So my question is, how do I skip/ignore/remove that element so I can parse the rest of the data?


Edit: Thanks to mjv for the solution!... I just did this (for others that have the same trouble)

$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML);
$rawXML = str_replace('</description>',']]></description>',$rawXML);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦萦几度 2024-08-13 10:07:37

您可能需要使用以下方法来操作源代码:

$xml = @file_get_contents('http://www.wow-europe.com/realmstatus/index.xml');
if ( $xml ) {
    $xml = preg_replace( '/ /', '&nbsp', $xml );
    $xml = new SimpleXMLElement($xml);
    var_dump($xml);
}

在将其提供给 xml 解析器之前,我很乐意推荐其他方法,但我认为这是唯一的方法。

编辑:我认为您实际上可以将 替换为 等等:

<?php
$xml = @file_get_contents('http://www.wow-europe.com/realmstatus/index.xml');
$xml = preg_replace( '/<description>/', '<description><![CDATA[', $xml );
$xml = preg_replace( '/<\/description>/', ']]></description>', $xml );
$xml = new SimpleXMLElement($xml);
var_dump($xml);

您需要为每个包含字符数据的元素。

You're probably going to need to manipulate the source code with something like:

$xml = @file_get_contents('http://www.wow-europe.com/realmstatus/index.xml');
if ( $xml ) {
    $xml = preg_replace( '/ /', '&nbsp', $xml );
    $xml = new SimpleXMLElement($xml);
    var_dump($xml);
}

Before feeding it to an xml parser AFAIK, I'd love to recommend some other way but I think this is the only way.

Edit: I think you can actually replace <description> with <description><![CDATA[ and so forth:

<?php
$xml = @file_get_contents('http://www.wow-europe.com/realmstatus/index.xml');
$xml = preg_replace( '/<description>/', '<description><![CDATA[', $xml );
$xml = preg_replace( '/<\/description>/', ']]></description>', $xml );
$xml = new SimpleXMLElement($xml);
var_dump($xml);

You'd need to do this for each element which contains character data.

饭团 2024-08-13 10:07:37

您可能需要引入一个预解析步骤,该步骤将添加

<![CDATA[

在每个之后。标签

]]>

在每个 之前 添加标签
具体来说,(请参阅 meder 对相应 PHP 片段的响应)

<description>blah <br />  blah, blah...</description>
should become
<description><![CDATA[blah <br />  blah, blah...]]></description>

以这种方式,“description”元素的完整内容将被“转义”,因此在此元素中找到的任何 html(甚至 xhtml)构造都可能抛出 XML解析逻辑将被忽略。这将解决  您提到的问题以及许多其他常见问题。

You may need to introduce a pre-parsing step which would add

<![CDATA[

after each <description> tag
and add

]]>

before each </description> tag
Specifically, (see meder's response for corresponding PHP snippet)

<description>blah <br />  blah, blah...</description>
should become
<description><![CDATA[blah <br />  blah, blah...]]></description>

In this fashion, the complete content of the 'decription' element would be 'escaped', so that any html (or even xhtml) construct found in this element and susceptible of throwing the XML parsing logic would be ignored. This would take care of the   problem you mention but also many other common issues.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文