XPath 的问题

发布于 2024-11-13 01:55:20 字数 916 浏览 2 评论 0原文

这是一个链接:

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/results/2010-2011/boxscore819588.html

我正在使用 HTML Agility Pack 和我想从“赔率”列中提取 188。我的编辑器在询问时给出 /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]为路径。我尝试了该路径,并省略了正文或 html,但是当传递给 .DocumentNode.SelectNodes() 时,它们都没有返回任何结果。我还尝试在开头使用 // (我认为它是文档树的根)。什么给?

编辑:

代码:

        WebClient client = new WebClient();
        string html = client.DownloadString(url);
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
        {
            Console.WriteLine("[" + node.InnerText + "]");
        }

Here's a link:

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/results/2010-2011/boxscore819588.html

I'm using HTML Agility Pack and I would like to extract, say, the 188 from the 'Odds' column. My editor gives /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7] when asked for path. I tried that path with various of omissions of body or html, but neither of them return any results when passed to .DocumentNode.SelectNodes(). I also tried with the // at the beginning (which, I assume, is the root of the document tree). What gives?

EDIT:

Code:

        WebClient client = new WebClient();
        string html = client.DownloadString(url);
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
        {
            Console.WriteLine("[" + node.InnerText + "]");
        }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

不离久伴 2024-11-20 01:55:20

在抓取网站时,您不能安全地依赖工具给出的确切 XPATH,因为一般来说,它们的限制性太大,而且实际上大多数时候什么也捕获不到。最好的方法是查看 HTML 并确定更能适应变化的内容。

下面是一段适用于您的示例的代码:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(your html);

    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[text()='MIA']/ancestor::tr/td[7]"))
    {
        Console.WriteLine(node.InnerText.Trim());
    }

它输出 188

它的工作方式是:

  • 选择一个内部文本设置为“MIA”的 A 元素
  • ,找到该 A 元素的父 TR 元素,
  • 找到该 TR 元素的第七个 TD
  • ,然后使用该 TD 元素的 InnerText 属性

When scraping sites, you can't rely safely on the exact XPATH given by tools as in general, they are too restrictive, and in fact catch nothing most of the time. The best way is to have a look at the HTML and determine something more resilient to changes.

Here is a piece of code that works with your example:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(your html);

    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[text()='MIA']/ancestor::tr/td[7]"))
    {
        Console.WriteLine(node.InnerText.Trim());
    }

It outputs 188.

The way it works is:

  • select an A element with inner text set to "MIA"
  • find the parent TR element of this A element
  • get to the seventh TD of this TR element
  • and then we use InnerText property of that TD element
暮年 2024-11-20 01:55:20

试试这个:

/html/body/form/div/div[2]/div/table/*/tr/td[2]/div/table/*/tr[3]/td[7]

* 捕获强制 元素,该元素是表的 DOM 表示的一部分,即使它没有在 HTML 中表示。

除此之外,通过 ID、CSS 类名或其他一些唯一属性而不是层次结构和文档结构进行选择会更加稳健:

//table[@class='data']//tr[3]/td[7]

Try this:

/html/body/form/div/div[2]/div/table/*/tr/td[2]/div/table/*/tr[3]/td[7]

The * catch the mandatory <tbody> element that is part of the DOM representation of tables even if it is not denoted in the HTML.

Other than that, it's more robust to select by ID, CSS class name or some other unique property instead of by hierarchy and document structure:

//table[@class='data']//tr[3]/td[7]
情丝乱 2024-11-20 01:55:20

默认情况下,HtmlAgilityPack 以不同方式对待表单标签(因为表单标签可以重叠),因此您需要从 xpath 中删除表单标签,例如: /html/body//div/div[2]/div/table/tr/td[2 ]/div/table/tr[3]/td[7]

另一种方法是强制 HtmlAgilityPack 将表单标签视为其他标签:

HtmlNode.ElementsFlags.Remove("form");

By default HtmlAgilityPack treats form tag differently (because form tags can overlap), so you need to remove form tag from xpath, for examle: /html/body//div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]

Other way is to force HtmlAgilityPack to treat form tag as others:

HtmlNode.ElementsFlags.Remove("form");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文