如何正确使用Scrubty从XML输出内容中抓取URL
我绝不是 Ruby 的高手,而且对 Scrubyt 也很陌生。我只是尝试在 wiki 页面上找到一些示例。我正在研究的示例是当您搜索“ruby”时获取 Google 返回的搜索结果,我的想法是获取每个结果的 URL,以便我也可以继续获取该页面。问题是我不知道如何正确获取 URL。这是我的以下代码:
require 'rubygems'
require 'scrubyt'
google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q','ruby'
submit
link_title "//a[@class='l']", :write_text => true do
link_url
end
end
google_data.to_xml.write($stdout, 1);
该代码适当地打印出 XML 数据(名称和链接),但是如何在没有似乎添加到其中的
标签的情况下检索链接(我尝试过打印出 link_url ,我注意到标签也被打印了)。我可以做一些像 fetch link_url
这样简单的事情吗?或者有没有办法从 link_url
中保存的 xml 内容中提取文本?
这是 google_data.to_xml.write()
打印的一些内容:
<root>
<link_title>
Ruby Programming Language
<link_url>http://ruby-lang.org/</link_url>
</link_title>
<link_title>
Download Ruby
<link_url>http://www.ruby-lang.org/en/downloads/</link_url>
</link_title>
<link_title>
Ruby - The Inspirational Weight Loss Journey on the Style Network ...
<link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
</link_title>
<link_title>
Ruby (programming language) - Wikipedia, the free encyclopedia
<link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
</link_title>
</root>
I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The problem is I don't know how to grab the URL appropriately. This is my following code:
require 'rubygems'
require 'scrubyt'
google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q','ruby'
submit
link_title "//a[@class='l']", :write_text => true do
link_url
end
end
google_data.to_xml.write($stdout, 1);
The code prints out the XML data appropriately (name and link) but how do I retrieve the link without the <link_url>
tags that seems to get added to it (I tried to print out link_url and I noticed the tags are printed as well). Could I do something as simple as fetch link_url
or is there a way of extracting the text from the xml content held in link_url
?
This is some of the content that gets printed by the google_data.to_xml.write()
:
<root>
<link_title>
Ruby Programming Language
<link_url>http://ruby-lang.org/</link_url>
</link_title>
<link_title>
Download Ruby
<link_url>http://www.ruby-lang.org/en/downloads/</link_url>
</link_title>
<link_title>
Ruby - The Inspirational Weight Loss Journey on the Style Network ...
<link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
</link_title>
<link_title>
Ruby (programming language) - Wikipedia, the free encyclopedia
<link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
</link_title>
</root>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会考虑替代方案。 Scrubyt已经有一段时间没有更新了,论坛也已经关闭了。
Mechanize 可以做 Extractor 所做的事情,Nokogiri 可以解析 XML 或 HTML 响应,Builder 可以创建 XML(尽管看起来您并不真正需要 XML)。
I'd think about alternatives. Scrubyt hasn't been updated in a while, and the forums have been shut down.
Mechanize can do what the Extractor does, Nokogiri can parse XML or HTML responses, and Builder can create XML (though it seems like you don't really want XML).