XPath 表达式在 HtmlAgilityPack 中不起作用
我知道这可能是我对 XPath 的不懂,但让我确认一下,因为我已经用 google 搜索了足够多的内容。
我有一个网站,想从中获取新闻标题:www.farsnews.com(它是波斯语)
在 firefox 下使用 FireBug 和 FireXpath 扩展,我手动提取并测试与标题匹配的多个 Xpath 表达式,例如:
* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[@class="topnewsinfotitle "]
* .//div[@class="topnewsinfotitle "]
我也使用 XPather 扩展测试了这些,它们似乎工作得很好,但是当我测试它们时...... SelectNodes 返回 null!
有什么线索或提示吗?
这是代码的一部分:
listBox2.ResetText();
HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[@class=\"topnewsinfotitle \"]");
listBox2.Items.Add(nc.Count+" Items selected!");
foreach (HtmlAgilityPack.HtmlNode node in nc) {
listBox2.Items.Add(node.InnerText);
}
谢谢。
I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.
I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)
Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:
* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[@class="topnewsinfotitle "]
* .//div[@class="topnewsinfotitle "]
I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!
Any clue or hint?
here is a chunk of the code:
listBox2.ResetText();
HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[@class=\"topnewsinfotitle \"]");
listBox2.Items.Add(nc.Count+" Items selected!");
foreach (HtmlAgilityPack.HtmlNode node in nc) {
listBox2.Items.Add(node.InnerText);
}
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我测试过你的表情。正如辩证法在评论中提到的,你有一个不应该存在的结尾空间。
”,请参阅评估:http://xmltools.dk/EQA-ACA6
返回“空序列 您的标题列表,请参阅: http://xmltools.dk/EgA2APAj
但是,如果可能还有其他类你使用这个( http://xmltools.dk/EwA8AJAW ):(
我看到它们是一个编码问题我提供的链接,但是,对于含义和所有 XPath 表达式来说并不重要,您可以删除
/text()
以获取节点而不仅仅是文本)但是,如果您拥有此网站,则应该使用 XML(可能是 RSS 或 ATOM)或 JSON 提供标题,这将具有更好的性能,而且最重要的是,更加安全。
I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.
Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6
Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj
However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):
(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove
/text()
to get the nodes instead of only the text)BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.