HTML Agility Pack 屏幕抓取 XPATH 不返回数据

发布于 2024-08-26 09:18:19 字数 1517 浏览 8 评论 0原文

我正在尝试为 Digikey 编写一个屏幕抓取工具,以便我们公司能够在零件停产时准确跟踪定价、零件可用性和产品更换。我在 Chrome Devtools 以及 Firefox 上的 Firebug 中看到的 XPATH 与我的 C# 程序中看到的似乎存在差异。

我当前正在抓取的页面是 http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND

我当前使用的代码非常快而且肮脏......

   //This function retrieves data from the digikey
   private static List<string> ExtractProductInfo(HtmlDocument doc)
   {
       List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
       List<string> m_unparsedProductInfo = new List<string>();

       //Base Node for part info
       string m_baseNode = @"//html[1]/body[1]/div[2]";

       //Write part info to list
       m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
       //More lines of similar form will go here for more info
       //this retrieves digikey PN

       foreach(HtmlNode node in m_unparsedProductInfoNodes)
       {
           m_unparsedProductInfo.Add(node.InnerText);
       }

       return m_unparsedProductInfo;
   }

虽然我使用的路径似乎是“正确的”,当我查看列表“m_unparsedProductInfoNodes”时,我一直得到 NULL

知道这里发生了什么吗?我还要补充一点,如果我在 baseNode 上执行“SelectNodes”,它只会返回一个 div,唯一重要的子节点是“cs=####”,这似乎随浏览器用户代理的不同而变化。如果我尝试以任何方式使用它(将 /cs=0 放入无法识别的浏览器的路径中),它会坚持认为我的表达式不会评估为节点集,但保留它们仍然会留下所有数据过去的问题div[2] 返回为 NULL。

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.

The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND

The code I'm currently using is pretty quick and dirty...

   //This function retrieves data from the digikey
   private static List<string> ExtractProductInfo(HtmlDocument doc)
   {
       List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
       List<string> m_unparsedProductInfo = new List<string>();

       //Base Node for part info
       string m_baseNode = @"//html[1]/body[1]/div[2]";

       //Write part info to list
       m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
       //More lines of similar form will go here for more info
       //this retrieves digikey PN

       foreach(HtmlNode node in m_unparsedProductInfoNodes)
       {
           m_unparsedProductInfo.Add(node.InnerText);
       }

       return m_unparsedProductInfo;
   }

Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"

Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

以可爱出名 2024-09-02 09:18:19

尝试使用这个 XPath 表达式:

/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]

在 Firefox 中使用 Google Chrome 开发工具和 Firebug,看起来网页在第一个表之前有一个“cs”和“rf”标签。例如:

<cs="0">
  <rf="141">
    <table>
    ...
    </table>
  </rf>
</cs>

当您想要解析已知 HTML 文件但未获得预期结果时,了解发生的情况可能会很有用。在这种情况下,我只是这样做:

string xpath = "";

//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
    if (node.InnerText.Trim() == "296-12602-1-ND")
        xpath = node.XPath; //Here it is
}

或者您可以在文档加载后调试应用程序,然后遍历每个子节点,直到找到要从中获取信息的节点。如果只是在找到 InnerText 时设置断点,则可以只遍历父节点,然后继续寻找其他节点。我通常会在“监视”窗口中手动输入命令,并使用树视图进行导航以查看属性、属性和子项。

Try using this XPath expression:

/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]

Using Google Chrome Developer Tools and Firebug in Firefox, it seems like webpage has a 'cs' and 'rf' tags before the first table. Something like:

<cs="0">
  <rf="141">
    <table>
    ...
    </table>
  </rf>
</cs>

There is something that might be useful to know what is happening when you want to parse a known HTML file and you're not getting results as expected. In this case I just did:

string xpath = "";

//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
    if (node.InnerText.Trim() == "296-12602-1-ND")
        xpath = node.XPath; //Here it is
}

Or you could just debug your application after document loads, and go through each child node until you find the node you want to get the info from. If you just set a breakpoint when InnerText is found, you can just go through parents and then keep looking for other nodes. I usually do that entering manually commands in a 'watch' window and navigating using the treeview to see properties, attributes and childs.

情栀口红 2024-09-02 09:18:19

只是为了更新:

我从 c# 切换到更友好的 Python(我的编程经验是 asm、c 和 python,整个 OO 事物是全新的)并设法纠正我的 xpath 问题。标签确实是问题所在,但幸运的是它是独一无二的,所以一点正则表达式和删除的行,我就处于良好状态。我不确定为什么这样的标签会破坏 XPATH。如果有人有一些见解,我想听听。

Just for an update:

I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文