如何获取中的所有内容使用 HTML Agility Pack 标记?
所以我正在编写一个可以进行一些屏幕抓取的应用程序。我正在使用 HTML Agility Pack 将整个 HTML 页面加载到 HtmlDocoument< 的实例中/code> 称为
doc
。现在我想解析该文档,寻找这个:
<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td>
The data I want is in here <br />
and it's seperated by these annoying <br /> 's.
No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags.
</td>
</tr>
</table>
所以我只需要获取第二行中的数据。我该怎么做?我应该使用正则表达式还是其他东西?
更新:这是我加载文档
的方式
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);
So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument
called doc
. Now I want to parse that doc, looking for this:
<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td>
The data I want is in here <br />
and it's seperated by these annoying <br /> 's.
No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags.
</td>
</tr>
</table>
So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?
Update: Here is how I'm loading my doc
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
由于您已经在使用 Html Agility Pack,我建议您使用它提供的方法来查找您想要的信息。导航文档的方法有多种,但最简洁的方法之一是使用 XPath。在这种情况下,你可以使用这样的东西:
Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:
“别的东西”是最好的答案——HTML 最好由 HTML 解析器解析,而不是通过正则表达式。我不是 C# 专家,但我听说 HTML Agility Pack 非常适合此用途。
"Something else" is the best answer -- HTML is best parsed by an HTML parser rather than via regular expressions. I'm no C# expert, but I hear the HTML Agility Pack is well-liked for this purpose.
我会说 som̡et̨hińg Else
I'd say som̡et̨hińg Else
使用 xml 解析器可能会获得更好的效果。
You'd probably get better mileage with an xml parser.
如果您已经在使用 Agility 包,那么只需使用 doc.DocumentNode.SelectNodes("//table[@cellspacing='3']") 来获取表格即可在文件中。尝试查看文档和编码示例。由于您已经拥有结构化数据,因此返回文本数据并重新解析是荒谬的。
If you're using the Agility pack already, then it's just a matter of using some thing
doc.DocumentNode.SelectNodes("//table[@cellspacing='3']")
to get the table in the document. Try looking through the documentation and coding examples. Since you already have structured data, it's rediculous to go back to the text data and reparse.