如何正确使用Scrubty从XML输出内容中抓取URL

发布于 2024-09-18 16:11:51 字数 1468 浏览 14 评论 0原文

我绝不是 Ruby 的高手,而且对 Scrubyt 也很陌生。我只是尝试在 wiki 页面上找到一些示例。我正在研究的示例是当您搜索“ruby”时获取 Google 返回的搜索结果,我的想法是获取每个结果的 URL,以便我也可以继续获取该页面。问题是我不知道如何正确获取 URL。这是我的以下代码:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q','ruby'
  submit

  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end

google_data.to_xml.write($stdout, 1);

该代码适当地打印出 XML 数据(名称和链接),但是如何在没有似乎添加到其中的 标签的情况下检索链接(我尝试过打印出 link_url ,我注意到标签也被打印了)。我可以做一些像 fetch link_url 这样简单的事情吗?或者有没有办法从 link_url 中保存的 xml 内容中提取文本?

这是 google_data.to_xml.write() 打印的一些内容:

<root>
  <link_title>
    Ruby Programming Language
    <link_url>http://ruby-lang.org/</link_url>
  </link_title>
  <link_title>
    Download Ruby
    <link_url>http://www.ruby-lang.org/en/downloads/</link_url>
  </link_title>
  <link_title>
    Ruby - The Inspirational Weight Loss Journey on the Style Network ...
    <link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
  </link_title>
  <link_title>
    Ruby (programming language) - Wikipedia, the free encyclopedia
    <link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
  </link_title>
</root>

I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The problem is I don't know how to grab the URL appropriately. This is my following code:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q','ruby'
  submit

  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end

google_data.to_xml.write($stdout, 1);

The code prints out the XML data appropriately (name and link) but how do I retrieve the link without the <link_url> tags that seems to get added to it (I tried to print out link_url and I noticed the tags are printed as well). Could I do something as simple as fetch link_url or is there a way of extracting the text from the xml content held in link_url?

This is some of the content that gets printed by the google_data.to_xml.write():

<root>
  <link_title>
    Ruby Programming Language
    <link_url>http://ruby-lang.org/</link_url>
  </link_title>
  <link_title>
    Download Ruby
    <link_url>http://www.ruby-lang.org/en/downloads/</link_url>
  </link_title>
  <link_title>
    Ruby - The Inspirational Weight Loss Journey on the Style Network ...
    <link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
  </link_title>
  <link_title>
    Ruby (programming language) - Wikipedia, the free encyclopedia
    <link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
  </link_title>
</root>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

温暖的光 2024-09-25 16:11:51

我会考虑替代方案。 Scrubyt已经有一段时间没有更新了,论坛也已经关闭了。

Mechanize 可以做 Extractor 所做的事情,Nokogiri 可以解析 XML 或 HTML 响应,Builder 可以创建 XML(尽管看起来您并不真正需要 XML)。

I'd think about alternatives. Scrubyt hasn't been updated in a while, and the forums have been shut down.

Mechanize can do what the Extractor does, Nokogiri can parse XML or HTML responses, and Builder can create XML (though it seems like you don't really want XML).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文