屏幕抓取:绕过“HTTP 错误 403:robots.txt 不允许的请求”

发布于 2024-09-01 17:54:43 字数 273 浏览 11 评论 0原文

有办法解决以下问题吗?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

这是联系网站所有者(barnesandnoble.com)的唯一方法。我正在建立一个可以为他们带来更多销售额的网站,不知道为什么他们会拒绝一定深度的访问。

我在 Python2.6 上使用 mechanize 和 BeautifulSoup。

希望能找到解决办法

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

孤君无依 2024-09-08 17:54:43

哦,你需要忽略 robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)

oh you need to ignore the robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)
落墨 2024-09-08 17:54:43

如果您想在 Barnes & 上遇到可能的法律麻烦,您可以尝试对您的用户代理撒谎(例如,试图让自己相信自己是人类而不是机器人)。高贵。为什么不联系他们的业务开发部门并说服他们专门授权您呢?毫无疑问,他们只是想避免他们的网站被某些类别的机器人(例如价格比较引擎)抓取,如果您能让他们相信您不是其中之一,签署合同等,他们很可能愿意对你来说是个例外。

只是破坏 robots.txt 中编码的策略的“技术”解决方法是一种高法律风险的方法,我永远不会推荐。顺便说一句,他们的 robots.txt 如何读取?

You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

赏烟花じ飞满天 2024-09-08 17:54:43

发出正确请求的代码:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content

The code to make a correct request:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content
离不开的别离 2024-09-08 17:54:43

Mechanize 会自动遵循 robots.txt,但如果您拥有许可,或者您已经考虑过道德规范,则可以将其禁用。

在您的浏览器中设置一个标志:

browser.set_handle_equiv(False) 

这会忽略 robots.txt。

另外,请确保限制您的请求,这样您就不会在他们的网站上施加过多的负载。 (请注意,这也降低了他们检测并禁止您的可能性)。

Mechanize automatically follows robots.txt, but it can be disabled assuming you have permission, or you have thought the ethics through ..

Set a flag in your browser:

browser.set_handle_equiv(False) 

This ignores robots.txt.

Also, make sure you throttle your requests, so you don't put too much load on their site. (Note, this also makes it less likely that they will detect and ban you).

初心 2024-09-08 17:54:43

您收到的错误与用户代理无关。默认情况下,当您使用 mechanize 导航到站点时,它会自动检查 robots.txt 指令。使用 mechanize.browser 的 .set_handle_robots(false) 方法来禁用此行为。

The error you're receiving is not related to the user agent. mechanize by default checks robots.txt directives automatically when you use it to navigate to a site. Use the .set_handle_robots(false) method of mechanize.browser to disable this behavior.

妄司 2024-09-08 17:54:43

设置您的 User-Agent 标头以匹配某些真实的 IE/FF 用户代理。

这是我的 IE8 用户代理字符串:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)

Set your User-Agent header to match some real IE/FF User-Agent.

Here's my IE8 useragent string:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)
悟红尘 2024-09-08 17:54:43

在不讨论道德问题的情况下,您可以修改标头以使其看起来像 googlebot,或者 googlebot 也被阻止吗?

Without debating the ethics of this you could modify the headers to look like the googlebot for example, or is the googlebot blocked as well?

恏ㄋ傷疤忘ㄋ疼 2024-09-08 17:54:43

看起来,你需要做更少的工作来绕过robots.txt至少说这篇文章。因此,您可能必须删除一些代码才能忽略过滤器。

As it seems, you have to do less work to bypass robots.txt, at least says this article. So you might have to remove some code to ignore the filter.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文