使用ruby的open-uri访问特定站点时出现503错误

发布于 2024-12-22 14:04:30 字数 424 浏览 8 评论 0原文

我一直在使用下面的代码来抓取网站，但我认为我可能抓取了太多内容，导致自己完全被禁止访问该网站。例如，我仍然可以在浏览器上访问该网站，但任何涉及 open-uri 和该网站的代码都会向我抛出 503 网站不可用错误。我认为这是特定于站点的，因为 open-uri 仍然可以很好地与 google 和 facebook 等设备配合使用。有解决方法吗？

require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$

topic = doc.at('span a.topic_name span').content
puts topic

原文

I had been using the code below to crawl a website, but I think I might have crawled too much and gotten myself banned from the site entirely. As in, I can still access the site on my browser, but any code involving open-uri and this site throws me a 503 site unavailable error. I think this is site specific because open-uri still works fine with, say, google and facebook. Is there a workaround for this?

require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$

topic = doc.at('span a.topic_name span').content
puts topic

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浮萍、无处依 2024-12-29 14:04:30

有一些解决方法，但最好的想法是按照他们的条件做一个好公民。
您可能需要确认您是否遵守他们的服务条款：

如果您操作搜索引擎或机器人，或者您重新发布所有 Quora 内容的很大一部分（我们可能会合理判断），您还必须遵守以下规则：

您必须使用描述性的用户代理标头。
您必须始终关注 robots.txt。
您必须在您的用户代理字符串中或在您的网站上（如果有的话）明确说明如何与您联系。

您可以使用 OpenURI 轻松设置用户代理标头：

Additional header fields can be specified by an optional hash argument.

  open("http://www.ruby-lang.org/en/",
    "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "[email protected]",
    "Referer" => "http://www.ruby-lang.org/") {|f|
    # ...
  }

可以检索 Robots.txt来自http://www.quora.com/robots.txt。您需要解析它并遵守其设置，否则他们会再次禁止您。

此外，您可能希望通过在循环之间休眠来限制代码的速度。

此外，如果您正在抓取他们的网站内容，您可能需要查看本地缓存页面，或使用其中一个抓取包。写一个蜘蛛很容易。编写一个能够与网站很好地配合的网站需要更多的工作，但总比根本无法抓取他们的网站要好。

There are workarounds, but the best idea is to be a good citizen according to their terms.
You might want to confirm that you are following their Terms of Service:

If you operate a search engine or robot, or you republish a significant fraction of all Quora Content (as we may determine in our reasonable discretion), you must additionally follow these rules:

You must use a descriptive user agent header.
You must follow robots.txt at all times.
You must make it clear how to contact you, either in your user agent string, or on your website if you have one.

You can set your user-agent header easily using OpenURI:

Additional header fields can be specified by an optional hash argument.

  open("http://www.ruby-lang.org/en/",
    "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "[email protected]",
    "Referer" => "http://www.ruby-lang.org/") {|f|
    # ...
  }

Robots.txt can be retrieved from http://www.quora.com/robots.txt. You'll need to parse it and honor its settings or they'll ban you again.

Also, you might want to restrict the speed of your code by sleeping between loops.

Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.

回复收藏 0 原文

~没有更多了~