敏捷包 XPath 问题

发布于 2024-11-15 04:25:20 字数 1016 浏览 4 评论 0原文

我正在尝试使用 HTML Agility Pack 在 Google 上查找特定关键字,然后检查链接的节点,直到找到我的网站字符串 url,然后解析我所在节点的 innerHTML 以获取 Google 排名。

我对敏捷包相对较新(例如,我昨天开始真正研究它),所以我希望我能得到一些帮助。当我执行下面的搜索时,每次 Xpath 查询都会失败。即使我插入像 SelectNodes("//*[@id='rso']") 这样简单的内容。这是我做错了什么吗?

    private void GoogleScrape(string url)
    {
        string[] keys = keywordBox.Text.Split(',');
        for (int i = 0; i < keys.Count(); i++)
        {
            var raw = "http://www.google.com/search?num=100&q=";
            string search = raw + HttpUtility.UrlEncode(keys[i]);
            var webGet = new HtmlWeb();
            var document = webGet.Load(search);
            loadtimeBox.Text = webGet.RequestDuration.ToString();

            var ranking = document.DocumentNode.SelectNodes("//*[@id='rso']");

            if (ranking != null)
            {
                googleBox.Text = "Something";
            }
            else
            {
                googleBox.Text = "Fail";
            }
           }
          }

I am attempting to use the HTML Agility Pack to look up specific keywords on Google, then check through linked nodes until it find my websites string url, then parse the innerHTML of the node I am on for my Google ranking.

I am relatively new to the Agility Pack (as in, I started really looking through it yesterday) so I was hoping I could get some help on it. When I do the search below, I get Failures on my Xpath queries every time. Even if I insert something as simple as SelectNodes("//*[@id='rso']"). Is this something I am doing incorrectly?

    private void GoogleScrape(string url)
    {
        string[] keys = keywordBox.Text.Split(',');
        for (int i = 0; i < keys.Count(); i++)
        {
            var raw = "http://www.google.com/search?num=100&q=";
            string search = raw + HttpUtility.UrlEncode(keys[i]);
            var webGet = new HtmlWeb();
            var document = webGet.Load(search);
            loadtimeBox.Text = webGet.RequestDuration.ToString();

            var ranking = document.DocumentNode.SelectNodes("//*[@id='rso']");

            if (ranking != null)
            {
                googleBox.Text = "Something";
            }
            else
            {
                googleBox.Text = "Fail";
            }
           }
          }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

疧_╮線 2024-11-22 04:25:20

这不是敏捷包的罪过——而是狡猾的谷歌的罪过。如果您使用调试器检查 HtmlDocument 的 _text 属性,您会发现当您在浏览器中检查具有 id='rso'

    时,它没有由于某种原因的任何属性。

我认为,在这种情况下,您可以通过“//ol”进行搜索,因为目前谷歌的结果页面中只有一个

    标签...

更新:我已经做了进一步的检查。例如,当我这样做时:

using (StreamReader sr = 
        new StreamReader(HttpWebRequest
          .Create("http://www.google.com/search?num=100&q=test")
          .GetResponse()
          .GetResponseStream()))
{
    string s = sr.ReadToEnd();
    var m2 = Regex.Matches(s, "\\sid=('[^']+'|\"[^\"]+\")");
    foreach (var x in m2)
        Console.WriteLine(x);
}

返回的唯一 ID 是:“sflas”、“hidden_​​modes”和“tbpr_12”。

总而言之:我使用过 Html Agility Pack,即使是格式错误的 html(未封闭的

甚至

  • 标签等),它也能很好地应对。
  • It's not the Agility pack's guilt -- it is tricky google's. If you inspect _text property of HtmlDocument with debugger, you'll find that <ol> that has id='rso' when you inspect it in a browser do not have any attributes for some reason.

    I think, in this case you can just serach by "//ol", because there is only one <ol> tag in the google's result page at the moment...

    UPDATE: I've done further checks. For example when I do this:

    using (StreamReader sr = 
            new StreamReader(HttpWebRequest
              .Create("http://www.google.com/search?num=100&q=test")
              .GetResponse()
              .GetResponseStream()))
    {
        string s = sr.ReadToEnd();
        var m2 = Regex.Matches(s, "\\sid=('[^']+'|\"[^\"]+\")");
        foreach (var x in m2)
            Console.WriteLine(x);
    }
    

    The only ids that are returned are: "sflas", "hidden_modes" and "tbpr_12".

    To conclude: I've used Html Agility Pack and it's coped pretty well even with malformed html (unclosed <p> and even <li> tags etc.).

    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文