XPath 表达式在 HtmlAgilityPack 中不起作用

发布于 2024-09-16 19:34:20 字数 908 浏览 3 评论 0原文

我知道这可能是我对 XPath 的不懂，但让我确认一下，因为我已经用 google 搜索了足够多的内容。

我有一个网站，想从中获取新闻标题：www.farsnews.com（它是波斯语）

在 firefox 下使用 FireBug 和 FireXpath 扩展，我手动提取并测试与标题匹配的多个 Xpath 表达式，例如：

* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[@class="topnewsinfotitle "]
* .//div[@class="topnewsinfotitle "]

我也使用 XPather 扩展测试了这些，它们似乎工作得很好，但是当我测试它们时...... SelectNodes 返回 null！

有什么线索或提示吗？

这是代码的一部分：

listBox2.ResetText();

HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[@class=\"topnewsinfotitle \"]");

listBox2.Items.Add(nc.Count+" Items selected!");

foreach (HtmlAgilityPack.HtmlNode node in nc) {
            listBox2.Items.Add(node.InnerText);
        }

谢谢。

原文

I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.

I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)

Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:

* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[@class="topnewsinfotitle "]
* .//div[@class="topnewsinfotitle "]

I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!

Any clue or hint?

here is a chunk of the code:

listBox2.ResetText();

HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[@class=\"topnewsinfotitle \"]");

listBox2.Items.Add(nc.Count+" Items selected!");

foreach (HtmlAgilityPack.HtmlNode node in nc) {
            listBox2.Items.Add(node.InnerText);
        }

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

早乙女 2024-09-23 19:34:20

我测试过你的表情。正如辩证法在评论中提到的，你有一个不应该存在的结尾空间。

//div[@class='topnewsinfotitle ']/text()

”，请参阅评估：http://xmltools.dk/EQA-ACA6

//div[@class='topnewsinfotitle']/text()

返回“空序列您的标题列表，请参阅： http://xmltools.dk/EgA2APAj

但是，如果可能还有其他类你使用这个（ http://xmltools.dk/EwA8AJAW ）：（

//div[contains(@class, 'topnewsinfotitle')]/text()

我看到它们是一个编码问题我提供的链接，但是，对于含义和所有 XPath 表达式来说并不重要，您可以删除 /text() 以获取节点而不仅仅是文本）

但是，如果您拥有此网站，则应该使用 XML（可能是 RSS 或 ATOM）或 JSON 提供标题，这将具有更好的性能，而且最重要的是，更加安全。

I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.

//div[@class='topnewsinfotitle ']/text()

Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6

//div[@class='topnewsinfotitle']/text()

Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj

However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):

//div[contains(@class, 'topnewsinfotitle')]/text()

(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove /text() to get the nodes instead of only the text)

BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.

回复收藏 0 原文

~没有更多了~