屏幕抓取:绕过“HTTP 错误 403:robots.txt 不允许的请求”
有办法解决以下问题吗?
httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
这是联系网站所有者(barnesandnoble.com)的唯一方法。我正在建立一个可以为他们带来更多销售额的网站,不知道为什么他们会拒绝一定深度的访问。
我在 Python2.6 上使用 mechanize 和 BeautifulSoup。
希望能找到解决办法
Is there a way to get around the following?
httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.
I'm using mechanize and BeautifulSoup on Python2.6.
hoping for a work-around
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
哦,你需要忽略 robots.txt
oh you need to ignore the robots.txt
如果您想在 Barnes & 上遇到可能的法律麻烦,您可以尝试对您的用户代理撒谎(例如,试图让自己相信自己是人类而不是机器人)。高贵。为什么不联系他们的业务开发部门并说服他们专门授权您呢?毫无疑问,他们只是想避免他们的网站被某些类别的机器人(例如价格比较引擎)抓取,如果您能让他们相信您不是其中之一,签署合同等,他们很可能愿意对你来说是个例外。
只是破坏 robots.txt 中编码的策略的“技术”解决方法是一种高法律风险的方法,我永远不会推荐。顺便说一句,他们的 robots.txt 如何读取?
You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.
A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?
发出正确请求的代码:
The code to make a correct request:
Mechanize 会自动遵循 robots.txt,但如果您拥有许可,或者您已经考虑过道德规范,则可以将其禁用。
在您的浏览器中设置一个标志:
这会忽略 robots.txt。
另外,请确保限制您的请求,这样您就不会在他们的网站上施加过多的负载。 (请注意,这也降低了他们检测并禁止您的可能性)。
Mechanize automatically follows robots.txt, but it can be disabled assuming you have permission, or you have thought the ethics through ..
Set a flag in your browser:
This ignores robots.txt.
Also, make sure you throttle your requests, so you don't put too much load on their site. (Note, this also makes it less likely that they will detect and ban you).
您收到的错误与用户代理无关。默认情况下,当您使用 mechanize 导航到站点时,它会自动检查 robots.txt 指令。使用 mechanize.browser 的 .set_handle_robots(false) 方法来禁用此行为。
The error you're receiving is not related to the user agent. mechanize by default checks robots.txt directives automatically when you use it to navigate to a site. Use the .set_handle_robots(false) method of mechanize.browser to disable this behavior.
设置您的
User-Agent
标头以匹配某些真实的 IE/FF 用户代理。这是我的 IE8 用户代理字符串:
Set your
User-Agent
header to match some real IE/FF User-Agent.Here's my IE8 useragent string:
在不讨论道德问题的情况下,您可以修改标头以使其看起来像 googlebot,或者 googlebot 也被阻止吗?
Without debating the ethics of this you could modify the headers to look like the googlebot for example, or is the googlebot blocked as well?
看起来,你需要做更少的工作来绕过
robots.txt
,至少说这篇文章。因此,您可能必须删除一些代码才能忽略过滤器。As it seems, you have to do less work to bypass
robots.txt
, at least says this article. So you might have to remove some code to ignore the filter.