通过 nokogiri 或 hpricot 进行屏幕抓取
我正在尝试获取给定 xpath 的实际值。我在sample.rb 文件中有以下代码
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
begin
doc.xpath('//*[@id="view_more"]').each do |link|
puts link.content
end
rescue Exception => e
puts "error"
end
end
输出是:
查看更多问题..
当我尝试获取其他不同 XPath 的值时,例如:/html/body/div[4]/div[3]/h1/span
然后我收到“错误”消息。
我在Nokogiri尝试过这个。我不知道为什么这只为少数 XPath 提供结果。
我在 Hpricot 中尝试了同样的操作。
http://hpricot.com/demonstrations
我粘贴了我的 url 和 XPath,然后看到了
的结果 //*[@id="view_more"]
作为
查看更多问题..
[此文本出现在最近问题标题的底部]
但它没有显示以下结果:/html/body/div[4]/div[3]/h1/span
对于这个 XPath,我期待结果 Bad
。
[这出现在 http://www.changebadtogood.com/ 作为 class="hero-unit" div 的第一个标头。 ]
I'm trying to get actual value of given xpath. I am having the following code in sample.rb file
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
begin
doc.xpath('//*[@id="view_more"]').each do |link|
puts link.content
end
rescue Exception => e
puts "error"
end
end
Output is:
View more issues ..
When I try to get the value for other a different XPath, such as:/html/body/div[4]/div[3]/h1/span
then I get the "error" message.
I tried in this in Nokogiri. I don't know why this is giving result for few XPaths only.
I tried the same in Hpricot.
http://hpricot.com/demonstrations
I paste my url and XPaths and I see the result for//*[@id="view_more"]
as
View more issues ..
[This text is present at bottom of recent issues header]
But it is not showing result for:/html/body/div[4]/div[3]/h1/span
For this XPath I'm expecting the result Bad
.
[This was present in
http://www.changebadtogood.com/ as the first header of class="hero-unit" div.]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的问题与糟糕的 XPath 选择器有关,与 Nokogiri 或 Hpricot 无关。让我们研究一下:
从中我们可以看到,只有两个 div 是
元素的子元素,因此
div[4]
无法选择一个。看来您正在尝试在此处选择跨度:
不要依赖导致此情况的脆弱标记(索引元素的匿名层次结构),而是使用文档的语义结构来获得更简单且更方便的选择器更坚固。使用 CSS 或 XPath 语法:
Your problem has to do with a poor XPath selector, and is unrelated to Nokogiri or Hpricot. Let's investigate:
From this we can see that there are only two divs that are children of the
<body>
element, and sodiv[4]
fails to select one.It appears that you're trying to select the span here:
Instead of relying on the fragile markup leading up to this (indexing anonymous hierarchies of element), use the semantic structure of the document to your advantage for a selector that is both simpler and more robust. Using either CSS or XPath syntax: