使用 Perl 解析损坏的 RSS 提要
我希望能够解析包含以下内容的 RSS 和 Atom 提要 无效的 XML。我遇到并想要修复的错误 包括“简单”的东西,例如 >
,其中结束 ;
是 缺失、缺失结束标签以及出现在 错误的顺序。
我想忽略理论上它是否会产生任何影响的问题 尝试解析格式错误的 XML 文档根本没有意义。一 “技术”术语似乎与我想做的事情相当接近 是“标签汤”。我应该使用哪些现有的 CPAN 模块来构建这样的 能够容忍或纠正此类简单错误的解析器 如上所述?
I would like to be able to parse RSS and Atom feeds that contain
non-valid XML. The errors I have encountered and would like to fix
include "simple" things such as a >
where the closing ;
is
missing, missing closing tags and closing tags that appear in the
wrong order.
I would like to ignore the question whether in theory it makes any
sense to attempt parsing malformed XML documents at all. One
"technical" term that seems to come rather close to what I want to do
is "tag soup". What existing CPAN modules should I use to build such a
parser that is able to tolerate or correct simple errors like those
described above?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
recover
标志为 LibXML ,如果你真的必须这样做,或者 XML-Liberal 如果你真的想太过分在解析任何旧垃圾时。我确信您想忽略解析非格式良好的文档是否有意义的问题,但忽略它并不会让它消失。大多数 RSS 工具会正确地完全拒绝任何格式不正确的 XML 输入;通常你应该效仿,除非你的工具是不寻常的东西,比如 RSS 调试器。
“标签汤”是一个专门与 HTML 解析相关的术语。 XML(以及 RSS 和 Atom)的中心思想之一是不存在这样的东西。
The
recover
flag to LibXML, if you really must, or XML-Liberal if you really want to go overboard in parsing any old rubbish.I'm sure you would like to ignore the question of whether parsing non-well-formed documents makes any sense, but ignoring it won't make it go away. Most RSS tools will correctly reject any non-well-formed XML input completely; you should generally follow suit, unless your tool is something unusual like an RSS debugger.
“Tag soup” is a term specifically related to HTML parsing. One of the central ideas of XML (and hence RSS and Atom) is that there is no such thing.