XPath 的问题
这是一个链接:
我正在使用 HTML Agility Pack 和我想从“赔率”列中提取 188。我的编辑器在询问时给出 /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
为路径。我尝试了该路径,并省略了正文或 html,但是当传递给 .DocumentNode.SelectNodes()
时,它们都没有返回任何结果。我还尝试在开头使用 //
(我认为它是文档树的根)。什么给?
编辑:
代码:
WebClient client = new WebClient();
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
{
Console.WriteLine("[" + node.InnerText + "]");
}
Here's a link:
I'm using HTML Agility Pack and I would like to extract, say, the 188 from the 'Odds' column. My editor gives /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
when asked for path. I tried that path with various of omissions of body or html, but neither of them return any results when passed to .DocumentNode.SelectNodes()
. I also tried with the //
at the beginning (which, I assume, is the root of the document tree). What gives?
EDIT:
Code:
WebClient client = new WebClient();
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
{
Console.WriteLine("[" + node.InnerText + "]");
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在抓取网站时,您不能安全地依赖工具给出的确切 XPATH,因为一般来说,它们的限制性太大,而且实际上大多数时候什么也捕获不到。最好的方法是查看 HTML 并确定更能适应变化的内容。
下面是一段适用于您的示例的代码:
它输出
188
。它的工作方式是:
When scraping sites, you can't rely safely on the exact XPATH given by tools as in general, they are too restrictive, and in fact catch nothing most of the time. The best way is to have a look at the HTML and determine something more resilient to changes.
Here is a piece of code that works with your example:
It outputs
188
.The way it works is:
试试这个:
* 捕获强制
元素,该元素是表的 DOM 表示的一部分,即使它没有在 HTML 中表示。
除此之外,通过 ID、CSS 类名或其他一些唯一属性而不是层次结构和文档结构进行选择会更加稳健:
Try this:
The * catch the mandatory
<tbody>
element that is part of the DOM representation of tables even if it is not denoted in the HTML.Other than that, it's more robust to select by ID, CSS class name or some other unique property instead of by hierarchy and document structure:
默认情况下,HtmlAgilityPack 以不同方式对待表单标签(因为表单标签可以重叠),因此您需要从 xpath 中删除表单标签,例如: /html/body//div/div[2]/div/table/tr/td[2 ]/div/table/tr[3]/td[7]
另一种方法是强制 HtmlAgilityPack 将表单标签视为其他标签:
By default HtmlAgilityPack treats form tag differently (because form tags can overlap), so you need to remove form tag from xpath, for examle: /html/body//div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
Other way is to force HtmlAgilityPack to treat form tag as others: