.Net 中是否有与 HTML 等效的 XmlReader?
我过去曾使用 HtmlAgilityPack 来解析 .Net 中的 HTML,但我不这样做不喜欢它只使用 DOM 模型这一事实。
在大型文档和/或具有大量嵌套的文档上,可能会遇到堆栈溢出或内存不足异常。另外,一般来说,基于 DOM 的解析模型比基于流的方法使用更多的内存,通常是因为想要使用 HTML 的进程可能一次只需要几个可用元素。
有谁知道 .Net 有一个不错的 HTML 解析器,它允许您以类似于 XmlReader 类的方式解析 HTML?即以仅向前流的方式
I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.
On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.
Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader
class? i.e. in a forward only streaming manner
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我通常使用 SgmlReader 来实现此目的: https://github.com/MindTouch/SGMLReader
正如其他人所说,存在一些问题,因为 HTML 不遵循与 XML 相同的格式良好的规则,因此它本质上很难解析,但 SgmlReader 通常可以做得很好。
I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader
Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.
问题是 HTML 可能格式错误。在解析文档的大部分之前,您无法知道哪个标记缺少结束标记(或者哪些标记放置的顺序不正确)。
如果您要解析的文档格式良好,为什么不使用
XmlReader
呢?The problem is that HTML can be malformed. And you can't know which tag is missing an end tag (or which tags are placed in the incorrect order) until you have parsed a larger part of the document.
If the documents that you'll parsed is well formed, why don't you use the
XmlReader
?