如何获取中的所有内容使用 HTML Agility Pack 标记？

发布于 2024-09-05 08:14:33 字数 800 浏览 9 评论 0原文

所以我正在编写一个可以进行一些屏幕抓取的应用程序。我正在使用 HTML Agility Pack 将整个 HTML 页面加载到 HtmlDocoument< 的实例中/code> 称为 doc。现在我想解析该文档，寻找这个：

<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td> 
The data I want is in here <br /> 
and it's seperated by these annoying <br /> 's.

No id's, classes, or even a single <p> tag. </p> Just a bunch of <br />  tags.
</td> 
</tr> 
</table>

所以我只需要获取第二行中的数据。我该怎么做？我应该使用正则表达式还是其他东西？

更新：这是我加载文档的方式

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);

原文

So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this:

<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td> 
The data I want is in here <br /> 
and it's seperated by these annoying <br /> 's.

No id's, classes, or even a single <p> tag. </p> Just a bunch of <br />  tags.
</td> 
</tr> 
</table>

So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

Update: Here is how I'm loading my doc

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鸠魁 2024-09-12 08:14:33

由于您已经在使用 Html Agility Pack，我建议您使用它提供的方法来查找您想要的信息。导航文档的方法有多种，但最简洁的方法之一是使用 XPath。在这种情况下，你可以使用这样的东西：

HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
                   .SelectNodes("//table[@cellspacing='3']/tr[2]/td")
                   .Single();
string text = node.InnerText;

Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:

HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
                   .SelectNodes("//table[@cellspacing='3']/tr[2]/td")
                   .Single();
string text = node.InnerText;

回复收藏 0 原文