如何查找“”中的href元素值用红宝石标记

发布于 2024-10-30 15:38:08 字数 848 浏览 7 评论 0原文

我的目标是找到谷歌搜索结果中的第一个结果并收集站点链接，所以我构建了这个脚本：

require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url

我得到一个像这样的字符串：

url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>

但我只需要链接（http://en.wikipedia.org/wiki/加仑）不是所有的 html 代码... 我该怎么做呢？我正在使用宝石：

require 'hpricot'
require 'open-uri'
require 'mechanize'

原文

My goal is to find the first result in google search resultes and collect the site link, so I built this script:

require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url

I get a string like this:

url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>

But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code...
How can I do it? I am using the gems:

require 'hpricot'
require 'open-uri'
require 'mechanize'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

静水深流 2024-11-06 15:38:08

您可以像这样获取属性值

(doc/"a")[16].attributes['href']

，但我不得不说幻数 16 似乎很脆弱。

您也不应该抓取搜索结果，您应该考虑使用自定义搜索 API 。

You can get the value of attributes like this

(doc/"a")[16].attributes['href']

but I have to say that the magic number 16 seems brittle.

You are also not supposed to scrape the search results, you should consider using the Custom Search API.

回复收藏 0 原文

野却迷人 2024-11-06 15:38:08

由于 mechanize 包含 nokogiri，您可以应该完全跳过 hpricot。它会不必要地减慢你的代码速度。你实际上是在做同样的事情两次。

require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)

puts search_results.links[16].href

Since mechanize includes nokogiri you ~~can~~ should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.

require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)

puts search_results.links[16].href

回复收藏 0 原文

冰魂雪魄 2024-11-06 15:38:08

不要使用 url = site.to_s 转换为字符串，而是 url = site[0].attributes['href']

回复收藏 0 原文

一梦等七年七年为一梦 2024-11-06 15:38:08

尝试使用：

site = doc.search("a[@href]")[16,1]

try to use:

site = doc.search("a[@href]")[16,1]

回复收藏 0 原文

作业与我同在 2024-11-06 15:38:08

Waitir 是检查网页布局的合理选择。

require 'rubygems'
require 'watir'

#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")

#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?

Waitir is a reasonable choice to check the layout of a web page.

require 'rubygems'
require 'watir'

#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")

#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?

回复收藏 0 原文

一紙繁鸢 2024-11-06 15:38:08

由于输入始终遵循相同的格式，因此您可以这样做：

url.split("href=\"").last.split("\"").first

Since the input is always going to follow the same format, you could just do:

url.split("href=\"").last.split("\"").first

回复收藏 0 原文

~没有更多了~

关于作者

太阳哥哥

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何查找“”中的href元素值用红宝石标记

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如何查找“”中的href元素值用红宝石标记

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。