如何查找“”中的href元素值用红宝石标记
我的目标是找到谷歌搜索结果中的第一个结果并收集站点链接,所以我构建了这个脚本:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
我得到一个像这样的字符串:
url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>
但我只需要链接(http://en.wikipedia.org/wiki/加仑)不是所有的 html 代码... 我该怎么做呢?我正在使用宝石:
require 'hpricot'
require 'open-uri'
require 'mechanize'
My goal is to find the first result in google search resultes and collect the site link, so I built this script:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this:
url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code...
How can I do it? I am using the gems:
require 'hpricot'
require 'open-uri'
require 'mechanize'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以像这样获取属性值
,但我不得不说 幻数 16 似乎很脆弱。
您也不应该抓取搜索结果,您应该考虑使用自定义搜索 API 。
You can get the value of attributes like this
but I have to say that the magic number 16 seems brittle.
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
由于 mechanize 包含 nokogiri,您
可以应该完全跳过 hpricot。它会不必要地减慢你的代码速度。你实际上是在做同样的事情两次。Since mechanize includes nokogiri you
canshould skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.不要使用
url = site.to_s
转换为字符串,而是url = site[0].attributes['href']
Instead of converting to a string with
url = site.to_s
dourl = site[0].attributes['href']
尝试使用:
try to use:
Waitir 是检查网页布局的合理选择。
Waitir is a reasonable choice to check the layout of a web page.
由于输入始终遵循相同的格式,因此您可以这样做:
Since the input is always going to follow the same format, you could just do: