使用 XDocument 通过 DTD 解析 XHTML
我需要从 XHTML 文档获取纯文本。
我确信我已经在此处读到过,WP7 上的 XDocument 不支持 DTD。但我找不到它。好吧,当我尝试使用 XDocument 通过 DTD 解析 XHTML 时,它会抛出 NotSuportedException。堆栈跟踪中的最后一次调用位于 System.Xml.XmlTextReaderImpl.ParseDoctypeDecl()。
即使我尝试使用一些虚拟的 XmlResolver,这也是完全相同的 - 它并没有真正被调用。 (以下是这个问题中的答案)。
所以我认为 WP7 确实不支持它。
嗯,我需要解析 XHTML 文档。到目前为止,我想出了两个(或多或少真实的)解决方案:
如果我删除该 DTD 声明,我就可以做到这一点。但是,XHTML 中可能存在一些字符实体,如果该字符实体不是预定义的 XML 实体之一,则会引发异常.
因此该解决方案仅适用于某些 XHTML。
我想到使用正则表达式。删除所有 html 标签非常容易,但“实体问题”仍然存在,因为我认为这不是替换所有实体的真正/好的解决方案。
有人遇到/解决过这个问题吗?如果我做错了什么,你能给我一些建议或者纠正我吗? 谢谢。
I need to get plain text from XHTML documents.
I am sure I already read somewhere here, that XDocument on WP7 does not support DTD. I cannot find it though. Well, when I try to parse XHTML with DTD using XDocument, it throws NotSuportedException. Last call in stacktrace is at System.Xml.XmlTextReaderImpl.ParseDoctypeDecl()
.
That is exactly same even if I try to use some dummy XmlResolver - it doesn't really get called. (following answer in this question).
So I assume that WP7 really doesn't support it.
Well, I need to parse XHTML docs. So far I came up with two (more or less real) solutions:
I can do that if I remove that DTD declaration. But, there can be some character entity in the XHTML, and then exception is thrown if that character entity is not one of the predefined XML entity.
So that solution works only for some XHTMLs.
I thought of using Regex. It is quite easy to remove all the html tags, but the 'entity problem' remains as I don't think it is real/good solution to do replace for all entities.
Anyone faced/solved this? Can you give me some advice or correct me if I am wrong on something?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
HTML Agility pack 是一个用于解析 html 文档的库,正如论坛上声称的那样,它有一个适用于 WP7 的版本
http: //htmlagilitypack.codeplex.com/discussions/225113
HTML Agility pack is a library for parsing html document, as claimed on the forum, it has a version for WP7
http://htmlagilitypack.codeplex.com/discussions/225113