防弹 SimpleXMLElement
每个人都知道我们应该始终使用 DOM 技术而不是正则表达式来从 HTML 中提取内容,但我感觉我永远不能相信 SimpleXML 扩展或类似的扩展。
我现在正在编写 OpenID 实现,并且尝试使用 SimpleXML 进行 HTML 发现 - 但我的第一次测试(使用 alixaxel.myopenid.com)产生了很多错误:
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag mismatch: link line 11 and head in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: </head> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 64: parser error : Entity 'copy' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: © 2008 <a href="http://janrain.com/">JanRain, Inc.</a> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID™ and the myOpenID™ website are in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID™ and the myOpenID™ website are in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 77: parser error : Opening and ending tag mismatch: link line 8 and html in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: </html> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag head line 3 in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag html line 2 in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
我记得有一种方法可以使 SimpleXML 始终如果文档包含错误或不包含错误,则独立解析文件 - 虽然我不记得具体的实现,但我认为它涉及使用 DOMDocument。确保 SimpleXML 始终解析任何给定文档的最佳方法是什么?
并且请不要建议使用 Tidy,我认为该扩展速度很慢并且在许多系统上不可用。
Everyone knows that we should always use DOM techniques instead of regexes to extract content from HTML, but I get the feeling that I can never trust the SimpleXML extension or similar ones.
I'm coding a OpenID implementation right now, and I tried using SimpleXML to do the HTML discovery - but my very first test (with alixaxel.myopenid.com) yielded a lot of errors:
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag mismatch: link line 11 and head in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: </head> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 64: parser error : Entity 'copy' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: © 2008 <a href="http://janrain.com/">JanRain, Inc.</a> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID™ and the myOpenID™ website are in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID™ and the myOpenID™ website are in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 77: parser error : Opening and ending tag mismatch: link line 8 and html in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: </html> in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag head line 3 in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag html line 2 in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6
Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6
I recall there was a way to make SimpleXML always parse a file, independently if the document contains errors or not - I can't remember the specific implementation though, but I think it involved using DOMDocument. What is the best way to make sure SimpleXML always parses any given document?
And please don't suggest using Tidy, I think the extension is slow and it's not available on many systems.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 DOM 的 loadHTML 加载 HTML,然后将结果导入到 SimpleXML。
IIRC,它仍然会被一些的东西噎住,但它会接受现实世界中损坏的网站中存在的几乎所有东西。
You can load the HTML with DOM's loadHTML then import the result to SimpleXML.
IIRC, it will still choke on some stuff but it will accept pretty much anything that exists in the real world of broken websites.
你总是可以尝试 SAX 解析器...对错误更稳健。
对于大型 XML 可能效率不高。
you could always try a SAX parser... A bit more robust to errors.
May not be as efficient on large XML.