C# - 解析网页的最佳方法?
我已将整个网页的 html 保存到一个字符串中,现在我想从链接中获取“href”值,最好能够稍后将它们保存到不同的字符串中。 最好的方法是什么?
我尝试将字符串保存为 .xml 文档并使用 XPathDocument 导航器解析它,但是(令人惊讶的是)它不能很好地导航非真正的 xml 文档。
正则表达式是实现我想要实现的目标的最佳方法吗?
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我可以推荐 HTML Agility Pack。 我在一些需要解析 HTML 的情况下使用过它,效果很好。 将 HTML 加载到其中后,您可以使用 XPath 表达式进行查询文档并获取您的锚标记(以及其中的其他任何内容)。
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
正则表达式是一种方法,但它可能会出现问题。
大多数 HTML 页面无法使用标准 html 技术进行解析,因为正如您所发现的,大多数页面都无法验证。
您可以花时间尝试集成 HTML Tidy 或类似工具,但直接集成会快得多构建您需要的正则表达式。
更新
在此更新时,我收到了 15 票赞成票和 9 票反对票。 我认为人们可能没有阅读这个问题,也没有阅读对此答案的评论。 OP 想要做的就是获取 href 值。 就是这样。从这个角度来看,一个简单的正则表达式就可以了。 如果作者想要解析其他项目,那么我不会像我在开头所说的那样推荐正则表达式,这充其量是有问题的。
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
为了处理各种形状和大小的 HTML,我更喜欢使用 HTMLAgility 包 @ http://www.codeplex.com/ htmlagilitypack 它允许您针对所需的节点编写 XPath,并在集合中获取这些返回值。
For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack @ http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.
也许您想要 Majestic 解析器之类的东西: http://www.majestic12.co.uk /projects/html_parser.php
还有一些其他选项可以处理片状 html。 正如其他人提到的,Html Agility Pack 值得一看。
我不认为正则表达式是 HTML 的理想解决方案,因为 HTML 不是上下文无关的。 他们可能会产生足够的(尽管不精确)结果; 即使确定性地识别 URI 也是一个混乱的问题。
Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.
如果可能的话,最好不要重新发现轮子。 存在一些很好的工具,可以将 HTML 转换为格式良好的 XML,或者充当 XmlReader:
以下是三个很好的工具:
TagSoup,一个开源程序,是一个基于 Java 和 SAX 的工具,由 约翰·考恩。 这是
一个用 Java 编写的符合 SAX 的解析器,它不是解析格式良好或有效的 XML,而是解析在野外发现的 HTML:糟糕、肮脏和粗鲁,尽管通常远非短小。 TagSoup 是为那些必须使用某种合理的应用程序设计来处理这些东西的人而设计的。 通过提供 SAX 接口,它允许将标准 XML 工具应用于甚至最差的 HTML。 TagSoup 还包括一个命令行处理器,可以读取 HTML 文件并生成干净的 HTML 或格式良好的 XML(非常接近 XHTML)。
Taggle 是 TagSoup 的商业 C++ 端口。
SgmlReader 是 Microsoft 克里斯·洛维特。
SgmlReader 是任何 SGML 文档上的 XmlReader API(包括对 HTML 的内置支持)。 还提供了一个命令行实用程序,可输出格式良好的 XML 结果。
下载包含独立可执行文件和完整源代码的 zip 文件:SgmlReader.zip
一项杰出成就是HTML 的纯 XSLT 2.0 解析器,由 大卫·卡莱尔。
阅读它的代码对我们每个人来说都是一次很好的学习练习。
根据描述:
“d:htmlparse(string)
d:htmlparse(字符串、命名空间、html-模式)
单参数形式相当于)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml ',true()))
使用一些内置的启发式方法将字符串解析为 HTML 和/或 XML)
控制元素的隐式打开和关闭。
它没有 HTML DTD 的完整知识,但有完整的列表
空元素和实体定义的完整列表。 HTML 实体,以及
十进制和十六进制字符引用均被接受。 注意 html 实体
即使 html-mode=false() 也能被识别。
元素名称小写(如果 html-mode 为 true())并放入
命名空间参数指定的命名空间(可以是“”来表示
无命名空间,除非输入具有显式命名空间声明,在
在哪种情况下这些将受到尊重。
如果 html-mode=true(),属性名称将小写"
阅读更详细的说明 这里
希望这有帮助,
迪米特
·诺瓦切夫。
It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.
我同意 Chris Lively 的观点,因为 HTML 通常格式不是很好,因此最好使用正则表达式。
从 此处 开始,RegExLib 应该可以帮助您入门
I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
From here on RegExLib should get you started
如果您知道或可以修复文档,使其至少格式良好,那么使用 xml 可能会更幸运。 如果您有好的 html(或者更确切地说,xhtml),.Net 中的 xml 系统应该能够处理它。 不幸的是,好的 html 极其罕见。
另一方面,正则表达式在解析 html 时确实很糟糕。 幸运的是,您不需要处理完整的 html 规范。 您需要担心的只是解析
href=
字符串以获取 url。 即使这也可能很棘手,所以我不会立即尝试。 相反,我将首先提出几个问题来尝试建立一些基本规则。 它们基本上都归结为“您对文档了解多少?”,但这里是:href=
也可以在文档中并且不属于锚标记)?You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing
href=
strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:href=
could also be in the document and not belong to an anchor tag)?我在这里链接了一些代码,可以让您使用“LINQ to HTML”...
寻找 C# HTML 解析器
I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser