为什么 Google 搜索返回 HTTP 错误 403?
考虑以下 Python 代码:
30 url = "http://www.google.com/search?hl=en&safe=off&q=Monkey" 31 url_object = urllib.request.urlopen(url); 32 print(url_object.read());
运行此代码时,会引发异常:
File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
但是,将其放入浏览器时,搜索会按预期返回。 这里发生了什么? 我该如何克服这个问题,以便能够以编程方式搜索 Google?
有什么想法吗?
Consider the following Python code:
30 url = "http://www.google.com/search?hl=en&safe=off&q=Monkey" 31 url_object = urllib.request.urlopen(url); 32 print(url_object.read());
When this is run, an Exception is thrown:
File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
However, when this is put into a browser, the search returns as expected. What's going on here? How can I overcome this so I can search Google programmatically?
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这应该可以解决问题
this should do the trick
如果您想通过编程接口“正确”执行 Google 搜索,请查看 Google API。 这些不仅是搜索 Google 的官方方式,而且如果 Google 更改其结果页面布局,它们也不太可能改变。
If you want to do Google searches "properly" through a programming interface, take a look at Google APIs. Not only are these the official way of searching Google, they are also not likely to change if Google changes their result page layout.
正如 lacqui 建议的,Google API 是他们希望您通过代码发出请求的方式。 不幸的是,我发现他们的文档是针对编写 AJAX 网页的人,而不是发出原始 HTTP 请求。 我使用 LiveHTTP header 来跟踪示例发出的 HTTP 请求,我发现 ddipaolo 的博客文章很有帮助。
还有一件事让我很困惑:它们将您限制为查询的前 64 个结果。 如果您只是为网络用户提供搜索框,通常这不是问题,但如果您尝试使用 Google 进行数据挖掘,则没有帮助。 我猜他们不希望你使用他们的 API 进行数据挖掘。 这个 64 的数字随着时间的推移而变化,并且在不同的搜索产品之间也有所不同。
更新:看来他们绝对不希望你进行数据挖掘。 最终,您会收到 403 错误,其中包含指向此 API 访问通知的链接。
他们还列出了其他违规行为,但我认为这是触发我的违规行为。 我可能得调查一下雅虎的 BOSS 服务。 好像没有那么多限制。
As lacqui suggested, the Google API's are the way they want you to make requests from code. Unfortunately, I found their documentation was aimed at people writing AJAX web pages, not making raw HTTP requests. I used LiveHTTP Headers to trace the HTTP requests that the samples made, and I found ddipaolo's blog post helpful.
One more thing that messed me up: they limit you to the first 64 results from a query. Usually not a problem if you are just providing web users with a search box, but not helpful if you're trying to use Google to go data mining. I guess they don't want you to go data mining using their API. That 64 number has changed over time and varies between search products.
Update: It appears they definitely do not want you to go data mining. Eventually, you get a 403 error with a link to this API access notice.
They also list other violations, but I think that's the one that triggered for me. I may have to investigate Yahoo's BOSS service. It doesn't seem to have as many restrictions.
你这样做太频繁了。 谷歌设有限制,以防止被搜索机器人淹没。 您还可以尝试将用户代理设置为更类似于普通浏览器的内容。
You're doing it too often. Google has limits in place to prevent getting swamped by search bots. You can also try setting the user-agent to something that more closely resembles a normal browser.