使用 Hpricot (Ruby) 解析 HTML 表

发布于 2024-07-15 15:34:07 字数 599 浏览 8 评论 0原文

我正在尝试使用 Hpricot 解析 HTML 表，但遇到困难，无法从具有指定 id 的页面中选择表元素。

这是我的 ruby 代码：-

require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

doc = Hpricot(page.body)

puts doc.to_html # Here the doc contains the full HTML page

puts doc.search("//table[@id='gvw_offices']").first # This is NIL

任何人都可以帮助我找出问题所在。

原文

I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.

Here is my ruby code:-

require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

doc = Hpricot(page.body)

puts doc.to_html # Here the doc contains the full HTML page

puts doc.search("//table[@id='gvw_offices']").first # This is NIL

Can anyone help me to identify what's wrong with this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独岁月 2024-07-22 15:34:07

Mechanize 将在内部使用 hpricot （它是 mechanize 的默认解析器）。更重要的是，它会将 hpricot 内容传递给解析器，因此您不必自己执行此操作：

require 'rubygems'
require 'mechanize'

#You don't really need this if you don't use hpricot directly
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

puts page.parser.to_html # page.parser returns the hpricot parser

puts page.at("//table[@id='gvw_offices']") # This passes through to hpricot

另请注意 page.search("foo").first 相当于 <代码> page.at（“foo”）。

Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:

require 'rubygems'
require 'mechanize'

#You don't really need this if you don't use hpricot directly
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

puts page.parser.to_html # page.parser returns the hpricot parser

puts page.at("//table[@id='gvw_offices']") # This passes through to hpricot

Also note that page.search("foo").first is equivalent to page.at("foo").

回复收藏 0 原文

居里长安 2024-07-22 15:34:07

请注意，Mechanize 在更高版本 (0.9.0) 中默认不再使用 Hpricot（它使用 Nokogiri），并且您必须显式指定 Hpricot 才能继续使用：

  WWW::Mechanize.html_parser = Hpricot

就像那样，Hpricot 周围没有引号或任何内容 - 可能有一个模块您可以指定为 Hpricot，因为如果您将此语句放在您自己的模块声明中，它将不起作用。这是在类顶部执行此操作的最佳方法（在打开模块或类之前）

require 'mechanize'
require 'hpricot'

# Later versions of Mechanize no longer use Hpricot by default
# but have an attribute we can set to use it
begin
  WWW::Mechanize.html_parser = Hpricot
rescue NoMethodError
  # must be using an older version of Mechanize that doesn't
  # have the html_parser attribute - just ignore it since 
  # this older version will use Hpricot anyway
end

通过使用救援块，您可以确保如果他们确实有旧版本的 mechanize，它不会因不存在的 html_parser 属性而呕吐。（否则您需要使您的代码依赖于最新版本的 Mechanize）

同样在最新版本中，WWW::Mechanize::List 已被弃用。不要问我为什么，因为它完全破坏了语句的向后兼容性，这些语句

page.forms.name('form1').first

曾经是一种常见的习惯用法，因为 Page#forms 返回了一个具有“名称”方法的机械化列表。现在它返回一个简单的表单数组。

我通过困难的方式发现了这一点，但是您的用法将会起作用，因为您使用的是 find 这是一种数组方法。

但是，查找具有给定名称的第一个表单的更好方法是 Page#form，因此您的表单查找行将成为

form = page.form('form1')

此方法适用于旧版本和新版本。

Note that Mechanize no longer uses Hpricot (it uses Nokogiri) by default in the later versions (0.9.0) and you have to explicitly specify Hpricot to continue using with:

  WWW::Mechanize.html_parser = Hpricot

Just like that, no quotes or anything around Hpricot - there's probably a module you can specify for Hpricot, because it won't work if you put this statement inside your own module declaration. Here's the best way to do it at the top of your class (before opening module or class)

require 'mechanize'
require 'hpricot'

# Later versions of Mechanize no longer use Hpricot by default
# but have an attribute we can set to use it
begin
  WWW::Mechanize.html_parser = Hpricot
rescue NoMethodError
  # must be using an older version of Mechanize that doesn't
  # have the html_parser attribute - just ignore it since 
  # this older version will use Hpricot anyway
end

By using the rescue block you ensure that if they do have an older version of mechanize, it won't barf on the nonexistent html_parser attribute. (Otherwise you need to make your code dependent on the latest version of Mechanize)

Also in the latest version, WWW::Mechanize::List was deprecated. Don't ask me why because it totally breaks backward compatibility for statements like

page.forms.name('form1').first

which used to be a common idiom that worked because Page#forms returned a mechanize List which had a "name" method. Now it returns a simple array of Forms.

I found this out the hard way, but your usage will work because you're using find which is a method of array.

But a better method for finding the first form with a given name is Page#form so your form finding line becomes