使用 Hpricot (Ruby) 解析 HTML 表
我正在尝试使用 Hpricot 解析 HTML 表,但遇到困难,无法从具有指定 id 的页面中选择表元素。
这是我的 ruby 代码:-
require 'rubygems'
require 'mechanize'
require 'hpricot'
agent = WWW::Mechanize.new
page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])
doc = Hpricot(page.body)
puts doc.to_html # Here the doc contains the full HTML page
puts doc.search("//table[@id='gvw_offices']").first # This is NIL
任何人都可以帮助我找出问题所在。
I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.
Here is my ruby code:-
require 'rubygems'
require 'mechanize'
require 'hpricot'
agent = WWW::Mechanize.new
page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])
doc = Hpricot(page.body)
puts doc.to_html # Here the doc contains the full HTML page
puts doc.search("//table[@id='gvw_offices']").first # This is NIL
Can anyone help me to identify what's wrong with this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Mechanize 将在内部使用 hpricot (它是 mechanize 的默认解析器)。 更重要的是,它会将 hpricot 内容传递给解析器,因此您不必自己执行此操作:
另请注意
page.search("foo").first
相当于 <代码> page.at(“foo”)。Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:
Also note that
page.search("foo").first
is equivalent topage.at("foo")
.请注意,Mechanize 在更高版本 (0.9.0) 中默认不再使用 Hpricot(它使用 Nokogiri),并且您必须显式指定 Hpricot 才能继续使用:
就像那样,Hpricot 周围没有引号或任何内容 - 可能有一个模块您可以指定为 Hpricot,因为如果您将此语句放在您自己的模块声明中,它将不起作用。 这是在类顶部执行此操作的最佳方法(在打开模块或类之前)
通过使用救援块,您可以确保如果他们确实有旧版本的 mechanize,它不会因不存在的 html_parser 属性而呕吐。 (否则您需要使您的代码依赖于最新版本的 Mechanize)
同样在最新版本中,WWW::Mechanize::List 已被弃用。 不要问我为什么,因为它完全破坏了语句的向后兼容性,这些语句
曾经是一种常见的习惯用法,因为 Page#forms 返回了一个具有“名称”方法的机械化列表。 现在它返回一个简单的表单数组。
我通过困难的方式发现了这一点,但是您的用法将会起作用,因为您使用的是
find
这是一种数组方法。但是,查找具有给定名称的第一个表单的更好方法是
Page#form
,因此您的表单查找行将成为此方法适用于旧版本和新版本。
Note that Mechanize no longer uses Hpricot (it uses Nokogiri) by default in the later versions (0.9.0) and you have to explicitly specify Hpricot to continue using with:
Just like that, no quotes or anything around Hpricot - there's probably a module you can specify for Hpricot, because it won't work if you put this statement inside your own module declaration. Here's the best way to do it at the top of your class (before opening module or class)
By using the rescue block you ensure that if they do have an older version of mechanize, it won't barf on the nonexistent html_parser attribute. (Otherwise you need to make your code dependent on the latest version of Mechanize)
Also in the latest version, WWW::Mechanize::List was deprecated. Don't ask me why because it totally breaks backward compatibility for statements like
which used to be a common idiom that worked because Page#forms returned a mechanize List which had a "name" method. Now it returns a simple array of Forms.
I found this out the hard way, but your usage will work because you're using
find
which is a method of array.But a better method for finding the first form with a given name is
Page#form
so your form finding line becomesthis method works with old an new versions.