如何正确使用Scrubty从XML输出内容中抓取URL

发布于 2024-09-18 16:11:51 字数 1468 浏览 17 评论 0原文

我绝不是 Ruby 的高手，而且对 Scrubyt 也很陌生。我只是尝试在 wiki 页面上找到一些示例。我正在研究的示例是当您搜索“ruby”时获取 Google 返回的搜索结果，我的想法是获取每个结果的 URL，以便我也可以继续获取该页面。问题是我不知道如何正确获取 URL。这是我的以下代码：

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q','ruby'
  submit

  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end

google_data.to_xml.write($stdout, 1);

该代码适当地打印出 XML 数据（名称和链接），但是如何在没有似乎添加到其中的标签的情况下检索链接（我尝试过打印出 link_url ，我注意到标签也被打印了）。我可以做一些像 fetch link_url 这样简单的事情吗？或者有没有办法从 link_url 中保存的 xml 内容中提取文本？

这是 google_data.to_xml.write() 打印的一些内容：

<root>
  <link_title>
    Ruby Programming Language
    <link_url>http://ruby-lang.org/</link_url>
  </link_title>
  <link_title>
    Download Ruby
    <link_url>http://www.ruby-lang.org/en/downloads/</link_url>
  </link_title>
  <link_title>
    Ruby - The Inspirational Weight Loss Journey on the Style Network ...
    <link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
  </link_title>
  <link_title>
    Ruby (programming language) - Wikipedia, the free encyclopedia
    <link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
  </link_title>
</root>

原文

I am by no means a master with Ruby and am quite new to Scrubyt. I was just trying out some examples found on there wiki page. The example i was working on was getting the search results returned by Google when you search for 'ruby' and I had the idea of grabbing the URL of each result so I could go ahead and fetch that page as well. The problem is I don't know how to grab the URL appropriately. This is my following code:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q','ruby'
  submit

  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end

google_data.to_xml.write($stdout, 1);

The code prints out the XML data appropriately (name and link) but how do I retrieve the link without the <link_url> tags that seems to get added to it (I tried to print out link_url and I noticed the tags are printed as well). Could I do something as simple as fetch link_url or is there a way of extracting the text from the xml content held in link_url?

This is some of the content that gets printed by the google_data.to_xml.write():

<root>
  <link_title>
    Ruby Programming Language
    <link_url>http://ruby-lang.org/</link_url>
  </link_title>
  <link_title>
    Download Ruby
    <link_url>http://www.ruby-lang.org/en/downloads/</link_url>
  </link_title>
  <link_title>
    Ruby - The Inspirational Weight Loss Journey on the Style Network ...
    <link_url>http://www.mystyle.com/mystyle/shows/ruby/index.jsp</link_url>
  </link_title>
  <link_title>
    Ruby (programming language) - Wikipedia, the free encyclopedia
    <link_url>http://en.wikipedia.org/wiki/Ruby_(programming_language)</link_url>
  </link_title>
</root>

分享到QQ

分享到微博