通过 nokogiri 或 hpricot 进行屏幕抓取

发布于 2024-12-10 20:43:58 字数 1183 浏览 3 评论 0原文

我正在尝试获取给定 xpath 的实际值。我在sample.rb 文件中有以下代码

require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
  begin
    doc.xpath('//*[@id="view_more"]').each do |link|
      puts link.content
    end
  rescue Exception => e
    puts "error" 
  end
end

输出是：

查看更多问题..

当我尝试获取其他不同 XPath 的值时，例如：
/html/body/div[4]/div[3]/h1/span 然后我收到“错误”消息。

我在Nokogiri尝试过这个。我不知道为什么这只为少数 XPath 提供结果。

我在 Hpricot 中尝试了同样的操作。
http://hpricot.com/demonstrations

我粘贴了我的 url 和 XPath，然后看到了
的结果 //*[@id="view_more"]
作为
查看更多问题..
[此文本出现在最近问题标题的底部]

但它没有显示以下结果：
/html/body/div[4]/div[3]/h1/span 对于这个 XPath，我期待结果 Bad。
[这出现在 http://www.changebadtogood.com/ 作为 class="hero-unit" div 的第一个标头。 ]

原文

I'm trying to get actual value of given xpath. I am having the following code in sample.rb file

require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
  begin
    doc.xpath('//*[@id="view_more"]').each do |link|
      puts link.content
    end
  rescue Exception => e
    puts "error" 
  end
end

Output is:

View more issues ..

When I try to get the value for other a different XPath, such as:
/html/body/div[4]/div[3]/h1/span
then I get the "error" message.

I tried in this in Nokogiri. I don't know why this is giving result for few XPaths only.

I tried the same in Hpricot.
http://hpricot.com/demonstrations

I paste my url and XPaths and I see the result for
//*[@id="view_more"]
as
View more issues ..
[This text is present at bottom of recent issues header]

But it is not showing result for:
/html/body/div[4]/div[3]/h1/span
For this XPath I'm expecting the result Bad.
[This was present in
http://www.changebadtogood.com/ as the first header of class="hero-unit" div.]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绻影浮沉 2024-12-17 20:43:58

您的问题与糟糕的 XPath 选择器有关，与 Nokogiri 或 Hpricot 无关。让我们研究一下：

irb:01:0> require 'nokogiri'; require 'open-uri'
#=> true
irb:02:0> doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')); nil
#=> nil
irb:03:0> doc.xpath('//*[@id="view_more"]').each{ |link| puts link.content }
View more issues ..
#=> 0
irb:04:0> doc.at('#view_more').text  # Simpler version of the above.
#=> "View more issues .."
irb:05:0> doc.xpath('/html/body/div[4]/div[3]/h1/span')
#=> []
irb:06:0> doc.xpath('/html/body/div[4]')
#=> []
irb:07:0> doc.xpath('/html/body/div').length
#=> 2

从中我们可以看到，只有两个 div 是元素的子元素，因此 div[4] 无法选择一个。

看来您正在尝试在此处选择跨度：

<h1 class="landing_page_title">
  Change <span style='color: #808080;'>Bad</span> To Good
</h1>

不要依赖导致此情况的脆弱标记（索引元素的匿名层次结构），而是使用文档的语义结构来获得更简单且更方便的选择器更坚固。使用 CSS 或 XPath 语法：

irb:08:0> doc.at('h1.landing_page_title > span').text
#=> "Bad"
irb:09:0> doc.at_xpath('//h1[@class="landing_page_title"]/span').text
#=> "Bad"

Your problem has to do with a poor XPath selector, and is unrelated to Nokogiri or Hpricot. Let's investigate:

irb:01:0> require 'nokogiri'; require 'open-uri'
#=> true
irb:02:0> doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')); nil
#=> nil
irb:03:0> doc.xpath('//*[@id="view_more"]').each{ |link| puts link.content }
View more issues ..
#=> 0
irb:04:0> doc.at('#view_more').text  # Simpler version of the above.
#=> "View more issues .."
irb:05:0> doc.xpath('/html/body/div[4]/div[3]/h1/span')
#=> []
irb:06:0> doc.xpath('/html/body/div[4]')
#=> []
irb:07:0> doc.xpath('/html/body/div').length
#=> 2

From this we can see that there are only two divs that are children of the <body> element, and so div[4] fails to select one.

It appears that you're trying to select the span here:

<h1 class="landing_page_title">
  Change <span style='color: #808080;'>Bad</span> To Good
</h1>

Instead of relying on the fragile markup leading up to this (indexing anonymous hierarchies of element), use the semantic structure of the document to your advantage for a selector that is both simpler and more robust. Using either CSS or XPath syntax:

irb:08:0> doc.at('h1.landing_page_title > span').text
#=> "Bad"
irb:09:0> doc.at_xpath('//h1[@class="landing_page_title"]/span').text
#=> "Bad"

回复收藏 0 原文

~没有更多了~

关于作者

醉城メ夜风

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

通过 nokogiri 或 hpricot 进行屏幕抓取

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

通过 nokogiri 或 hpricot 进行屏幕抓取

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。