使用ruby的open-uri访问特定站点时出现503错误
我一直在使用下面的代码来抓取网站,但我认为我可能抓取了太多内容,导致自己完全被禁止访问该网站。例如,我仍然可以在浏览器上访问该网站,但任何涉及 open-uri 和该网站的代码都会向我抛出 503 网站不可用错误。我认为这是特定于站点的,因为 open-uri 仍然可以很好地与 google 和 facebook 等设备配合使用。有解决方法吗?
require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$
topic = doc.at('span a.topic_name span').content
puts topic
I had been using the code below to crawl a website, but I think I might have crawled too much and gotten myself banned from the site entirely. As in, I can still access the site on my browser, but any code involving open-uri and this site throws me a 503 site unavailable error. I think this is site specific because open-uri still works fine with, say, google and facebook. Is there a workaround for this?
require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$
topic = doc.at('span a.topic_name span').content
puts topic
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有一些解决方法,但最好的想法是按照他们的条件做一个好公民。
您可能需要确认您是否遵守他们的服务条款:
您可以使用 OpenURI 轻松设置用户代理标头:
可以检索 Robots.txt来自
http://www.quora.com/robots.txt
。您需要解析它并遵守其设置,否则他们会再次禁止您。此外,您可能希望通过在循环之间休眠来限制代码的速度。
此外,如果您正在抓取他们的网站内容,您可能需要查看本地缓存页面,或使用其中一个抓取包。写一个蜘蛛很容易。编写一个能够与网站很好地配合的网站需要更多的工作,但总比根本无法抓取他们的网站要好。
There are workarounds, but the best idea is to be a good citizen according to their terms.
You might want to confirm that you are following their Terms of Service:
You can set your user-agent header easily using OpenURI:
Robots.txt can be retrieved from
http://www.quora.com/robots.txt
. You'll need to parse it and honor its settings or they'll ban you again.Also, you might want to restrict the speed of your code by sleeping between loops.
Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.