屏幕抓取：绕过“HTTP 错误 403：robots.txt 不允许的请求”

发布于 2024-09-01 17:54:43 字数 273 浏览 11 评论 0原文

有办法解决以下问题吗？

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

这是联系网站所有者（barnesandnoble.com）的唯一方法。我正在建立一个可以为他们带来更多销售额的网站，不知道为什么他们会拒绝一定深度的访问。

我在 Python2.6 上使用 mechanize 和 BeautifulSoup。

希望能找到解决办法

原文

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤君无依 2024-09-08 17:54:43

哦，你需要忽略 robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)

oh you need to ignore the robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)

回复收藏 0 原文

落墨 2024-09-08 17:54:43

如果您想在 Barnes & 上遇到可能的法律麻烦，您可以尝试对您的用户代理撒谎（例如，试图让自己相信自己是人类而不是机器人）。高贵。为什么不联系他们的业务开发部门并说服他们专门授权您呢？毫无疑问，他们只是想避免他们的网站被某些类别的机器人（例如价格比较引擎）抓取，如果您能让他们相信您不是其中之一，签署合同等，他们很可能愿意对你来说是个例外。

只是破坏 robots.txt 中编码的策略的“技术”解决方法是一种高法律风险的方法，我永远不会推荐。顺便说一句，他们的 robots.txt 如何读取？

回复收藏 0 原文

赏烟花じ飞满天 2024-09-08 17:54:43

发出正确请求的代码：

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content

The code to make a correct request:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content

回复收藏 0 原文

离不开的别离 2024-09-08 17:54:43

Mechanize 会自动遵循 robots.txt，但如果您拥有许可，或者您已经考虑过道德规范，则可以将其禁用。

在您的浏览器中设置一个标志：

browser.set_handle_equiv(False)

这会忽略 robots.txt。

另外，请确保限制您的请求，这样您就不会在他们的网站上施加过多的负载。（请注意，这也降低了他们检测并禁止您的可能性）。

Mechanize automatically follows robots.txt, but it can be disabled assuming you have permission, or you have thought the ethics through ..

Set a flag in your browser:

browser.set_handle_equiv(False)

This ignores robots.txt.

Also, make sure you throttle your requests, so you don't put too much load on their site. (Note, this also makes it less likely that they will detect and ban you).

回复收藏 0 原文

初心 2024-09-08 17:54:43

您收到的错误与用户代理无关。默认情况下，当您使用 mechanize 导航到站点时，它会自动检查 robots.txt 指令。使用 mechanize.browser 的 .set_handle_robots(false) 方法来禁用此行为。

回复收藏 0 原文

妄司 2024-09-08 17:54:43

设置您的 User-Agent 标头以匹配某些真实的 IE/FF 用户代理。

这是我的 IE8 用户代理字符串：

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)

Set your User-Agent header to match some real IE/FF User-Agent.

Here's my IE8 useragent string:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)

回复收藏 0 原文