如何使用 Ruby / Rails 从网站获取内容?
我想使用 ruby/rails 从网站复制一些特定内容。 我需要的内容位于一个 marquee html 标签内,由 div 分隔。 我如何使用 ruby 访问此内容? 更准确地说 - 我想使用某种 ruby gui (最好是鞋子)。 我该怎么做?
I want to copy some specific content from a website using ruby/rails.
The content I need is inside a marquee html tag, divided by divs.
How can I get access to this content using ruby?
To be more precise - I want to use some kind of ruby gui (Preferably shoes).
How do I do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这并不是一个真正的 Rails 问题。这是你使用 Ruby 做的事情,然后可能使用 Rails、Sinatra 或 Padrino 来显示 - 选择你的毒药。
您可以使用多种不同的 HTTP 客户端:
Open-URI 随 Ruby 一起提供,是最简单的。 Net::HTTP 随 Ruby 一起提供,是标准工具箱,但它是较低级别的,因此您必须做更多工作。 HTTPClient 和 Typhoeus+Hydra 都具有线程能力,并且具有高层和低层接口。
我建议使用 Nokogiri 来解析返回的 HTML。它的功能非常齐全且强大。
如果您需要在到达需要解析的页面之前浏览登录屏幕或填写表单,那么我建议您查看 Mechanize。它内部依赖于 Nokogiri,因此您可以向它请求 Nokogiri 文档,并在 Mechanize 检索到所需 URL 后进行解析。
如果您需要处理动态 HTML,请查看各种 WATIR 工具。它们驱动各种网络浏览器,然后让您访问浏览器所看到的内容。
获得所需的内容或数据后,您可以将其“重新调整用途”为 Rails 页面内的文本。
This isn't really a Rails question. It's something you'd do using Ruby, then possibly display using Rails, or Sinatra or Padrino - pick your poison.
There are several different HTTP clients you can use:
Open-URI comes with Ruby and is the easiest. Net::HTTP comes with Ruby and is the standard toolbox, but it's lower-level so you'd have to do more work. HTTPClient and Typhoeus+Hydra are capable of threading and have both high-level and low-level interfaces.
I recommend using Nokogiri to parse the returned HTML. It's very full-featured and robust.
If you need to navigate through login screens or fill in forms before you get to the page you need to parse, then I'd recommend looking at Mechanize. It relies on Nokogiri internally so you can ask it for a Nokogiri document and parse away once Mechanize retrieves the desired URL.
If you need to deal with Dynamic HTML, then look into the various WATIR tools. They drive various web browsers then let you access the content as seen by the browser.
Once you have the content or data you want, you can "repurpose" it into text inside a Rails page.
如果我理解正确的话,你需要一个网站抓取工具的 GUI 界面。如果是这样,您可能必须自己构建一个。
抓取网站的最简单方法是使用 nokogiri 或机械化宝石。基本上,您将为这些库提供网站的地址,然后使用它们的 XPath 功能从 DOM 中选择文本。
https://github.com/sparklemotion/nokogiri
https://github.com/sparklemotion/mechanize (用于文档)
If I'm to understand correctly, you want a GUI interface to a website scraper. If that's so, you might have to build one yourself.
The easiest way to scrape a website is using nokogiri or mechanize gems. Basically, you will give those libraries the address of the website and then use their XPath capabilities to select the text out of the DOM.
https://github.com/sparklemotion/nokogiri
https://github.com/sparklemotion/mechanize (for the documentation)