无法解析格式错误的 XML
我一直在尝试解析此提要。如果您单击该链接,您会发现它甚至无法在浏览器中正确解析它。
无论如何,我的托管服务不允许我使用 simplexml_load_file,所以我一直使用 cURL 来获取它,然后将字符串加载到 DOM 中,如下所示:
$dom = new DOMDocument;
$dom->loadXML($rawXML);
if (!$dom) {
echo 'Error while parsing the document';
exit;
}
$xml = simplexml_import_dom($dom);
但我收到错误 ("DOMDocument::loadXML() [domdocument.loadxml ]:实体'nbsp'未在实体中定义”),然后我尝试使用SimpleXMLElement,但没有成功(它显示相同的错误“解析器错误:实体'nbsp'未定义”等......因为该元素中的HTML )。
$xml = new SimpleXMLElement($rawXML);
所以我的问题是,如何跳过/忽略/删除该元素以便可以解析其余数据?
编辑:感谢 mjv 提供解决方案!...我刚刚这样做了(对于其他有同样问题的人)
$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML);
$rawXML = str_replace('</description>',']]></description>',$rawXML);
I've been trying to parse this feed. If you click on that link, you'll notice that it can't even parse it correctly in the browser.
Anyway, my hosting service won't let me use simplexml_load_file, so I've been using cURL to get it then loading the string into the DOM, like this:
$dom = new DOMDocument;
$dom->loadXML($rawXML);
if (!$dom) {
echo 'Error while parsing the document';
exit;
}
$xml = simplexml_import_dom($dom);
But I get errors ("DOMDocument::loadXML() [domdocument.loadxml]: Entity 'nbsp' not defined in Entity"), then I tried using SimpleXMLElement without luck (it shows the same error "parser error : Entity 'nbsp' not defined", etc... because of the HTML in that one element).
$xml = new SimpleXMLElement($rawXML);
So my question is, how do I skip/ignore/remove that element so I can parse the rest of the data?
Edit: Thanks to mjv for the solution!... I just did this (for others that have the same trouble)
$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML);
$rawXML = str_replace('</description>',']]></description>',$rawXML);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能需要使用以下方法来操作源代码:
在将其提供给 xml 解析器之前,我很乐意推荐其他方法,但我认为这是唯一的方法。
编辑:我认为您实际上可以将等等:
替换为您需要为每个包含字符数据的元素。
You're probably going to need to manipulate the source code with something like:
Before feeding it to an xml parser AFAIK, I'd love to recommend some other way but I think this is the only way.
Edit: I think you can actually replace
<description>
with<description><![CDATA[
and so forth:You'd need to do this for each element which contains character data.
您可能需要引入一个预解析步骤,该步骤将添加
在每个之后。标签
并
在每个 之前 添加标签
具体来说,(请参阅 meder 对相应 PHP 片段的响应)
以这种方式,“description”元素的完整内容将被“转义”,因此在此元素中找到的任何 html(甚至 xhtml)构造都可能抛出 XML解析逻辑将被忽略。这将解决 您提到的问题以及许多其他常见问题。
You may need to introduce a pre-parsing step which would add
after each <description> tag
and add
before each </description> tag
Specifically, (see meder's response for corresponding PHP snippet)
In this fashion, the complete content of the 'decription' element would be 'escaped', so that any html (or even xhtml) construct found in this element and susceptible of throwing the XML parsing logic would be ignored. This would take care of the problem you mention but also many other common issues.