当前位置：文江博客话题详情

C# - 解析网页的最佳方法？

发布于 2024-07-10 01:14:16 字数 213 浏览 12 评论 0原文

我已将整个网页的 html 保存到一个字符串中，现在我想从链接中获取“href”值，最好能够稍后将它们保存到不同的字符串中。最好的方法是什么？

我尝试将字符串保存为 .xml 文档并使用 XPathDocument 导航器解析它，但是（令人惊讶的是）它不能很好地导航非真正的 xml 文档。

正则表达式是实现我想要实现的目标的最佳方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

作业与我同在 2024-07-17 01:14:16

我可以推荐 HTML Agility Pack。我在一些需要解析 HTML 的情况下使用过它，效果很好。将 HTML 加载到其中后，您可以使用 XPath 表达式进行查询文档并获取您的锚标记（以及其中的其他任何内容）。

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

回复收藏 0 原文

忆悲凉 2024-07-17 01:14:16

正则表达式是一种方法，但它可能会出现问题。

大多数 HTML 页面无法使用标准 html 技术进行解析，因为正如您所发现的，大多数页面都无法验证。

您可以花时间尝试集成 HTML Tidy 或类似工具，但直接集成会快得多构建您需要的正则表达式。

更新

在此更新时，我收到了 15 票赞成票和 9 票反对票。我认为人们可能没有阅读这个问题，也没有阅读对此答案的评论。 OP 想要做的就是获取 href 值。 就是这样。从这个角度来看，一个简单的正则表达式就可以了。如果作者想要解析其他项目，那么我不会像我在开头所说的那样推荐正则表达式，这充其量是有问题的。

回复收藏 0 原文

我最亲爱的 2024-07-17 01:14:16

为了处理各种形状和大小的 HTML，我更喜欢使用 HTMLAgility 包 @ http://www.codeplex.com/ htmlagilitypack 它允许您针对所需的节点编写 XPath，并在集合中获取这些返回值。

回复收藏 0 原文

月亮是我掰弯的 2024-07-17 01:14:16

也许您想要 Majestic 解析器之类的东西： http://www.majestic12.co.uk /projects/html_parser.php

还有一些其他选项可以处理片状 html。正如其他人提到的，Html Agility Pack 值得一看。

我不认为正则表达式是 HTML 的理想解决方案，因为 HTML 不是上下文无关的。他们可能会产生足够的（尽管不精确）结果；即使确定性地识别 URI 也是一个混乱的问题。

回复收藏 0 原文

如痴如狂 2024-07-17 01:14:16

如果可能的话，最好不要重新发现轮子。存在一些很好的工具，可以将 HTML 转换为格式良好的 XML，或者充当 XmlReader：

以下是三个很好的工具：

TagSoup，一个开源程序，是一个基于 Java 和 SAX 的工具，由 约翰·考恩。这是
一个用 Java 编写的符合 SAX 的解析器，它不是解析格式良好或有效的 XML，而是解析在野外发现的 HTML：糟糕、肮脏和粗鲁，尽管通常远非短小。 TagSoup 是为那些必须使用某种合理的应用程序设计来处理这些东西的人而设计的。通过提供 SAX 接口，它允许将标准 XML 工具应用于甚至最差的 HTML。 TagSoup 还包括一个命令行处理器，可以读取 HTML 文件并生成干净的 HTML 或格式良好的 XML（非常接近 XHTML）。
Taggle 是 TagSoup 的商业 C++ 端口。
SgmlReader 是 Microsoft 克里斯·洛维特。
SgmlReader 是任何 SGML 文档上的 XmlReader API（包括对 HTML 的内置支持）。还提供了一个命令行实用程序，可输出格式良好的 XML 结果。
下载包含独立可执行文件和完整源代码的 zip 文件：SgmlReader.zip
一项杰出成就是HTML 的纯 XSLT 2.0 解析器，由 大卫·卡莱尔。

阅读它的代码对我们每个人来说都是一次很好的学习练习。

根据描述：

“d:htmlparse(string)
d:htmlparse(字符串、命名空间、html-模式)

  单参数形式相当于）
d:htmlparse(string,'http://ww.w3.org/1999/xhtml ',true()))
  使用一些内置的启发式方法将字符串解析为 HTML 和/或 XML）
  控制元素的隐式打开和关闭。
  它没有 HTML DTD 的完整知识，但有完整的列表
  空元素和实体定义的完整列表。 HTML 实体，以及
  十进制和十六进制字符引用均被接受。注意 html 实体
即使 html-mode=false() 也能被识别。
  元素名称小写（如果 html-mode 为 true()）并放入
  命名空间参数指定的命名空间（可以是“”来表示
  无命名空间，除非输入具有显式命名空间声明，在
  在哪种情况下这些将受到尊重。

如果 html-mode=true()，属性名称将小写"

阅读更详细的说明这里

希望这有帮助，

迪米特

·诺瓦切夫。

It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:

Here are three good tools:

TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.

Reading its code would be a great learning exercise for everyone of us.

From the description:

"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)

  The one argument form is equivalent to)
  d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
  Parses the string as HTML and/or XML using some inbuilt heuristics to)
  control implied opening and closing of elements.
  It doesn't have full knowledge of HTML DTD but does have full list of
  empty elements and full list of entity definitions. HTML entities, and
  decimal and hex character references are all accepted. Note html-entities
  are recognised even if html-mode=false().
  Element names are lowercased (if html-mode is true()) and placed into the
  namespace specified by the namespace parameter (which may be "" to denote
  no-namespace unless the input has explict namespace declarations, in
  which case these will be honoured.

Attribute names are lowercased if html-mode=true()"

Read a more detailed description here.

Hope this helped.

Cheers,

Dimitre Novatchev.

回复收藏 0 原文

筑梦 2024-07-17 01:14:16

我同意 Chris Lively 的观点，因为 HTML 通常格式不是很好，因此最好使用正则表达式。

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

从此处开始，RegExLib 应该可以帮助您入门

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

From here on RegExLib should get you started

回复收藏 0 原文

我是有多爱你 2024-07-17 01:14:16

如果您知道或可以修复文档，使其至少格式良好，那么使用 xml 可能会更幸运。如果您有好的 html（或者更确切地说，xhtml），.Net 中的 xml 系统应该能够处理它。不幸的是，好的 html 极其罕见。

另一方面，正则表达式在解析 html 时确实很糟糕。幸运的是，您不需要处理完整的 html 规范。您需要担心的只是解析 href= 字符串以获取 url。即使这也可能很棘手，所以我不会立即尝试。相反，我将首先提出几个问题来尝试建立一些基本规则。它们基本上都归结为“您对文档了解多少？”，但这里是：