为什么 Google 搜索返回 HTTP 错误 403?

发布于 2024-07-14 12:34:02 字数 526 浏览 4 评论 0原文

考虑以下 Python 代码:

 30    url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
 31    url_object = urllib.request.urlopen(url);
 32    print(url_object.read());

运行此代码时,会引发异常:

File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
   raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

但是,将其放入浏览器时,搜索会按预期返回。 这里发生了什么? 我该如何克服这个问题,以便能够以编程方式搜索 Google?

有什么想法吗?

Consider the following Python code:

 30    url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
 31    url_object = urllib.request.urlopen(url);
 32    print(url_object.read());

When this is run, an Exception is thrown:

File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
   raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

However, when this is put into a browser, the search returns as expected. What's going on here? How can I overcome this so I can search Google programmatically?

Any thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

伤感在游骋 2024-07-21 12:34:02

这应该可以解决问题

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
headers={'User-Agent':user_agent,} 

request=urllib2.Request(url,None,headers) //The assembled request
response = urllib2.urlopen(request)
data = response.read() // The data u need

this should do the trick

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
headers={'User-Agent':user_agent,} 

request=urllib2.Request(url,None,headers) //The assembled request
response = urllib2.urlopen(request)
data = response.read() // The data u need
把人绕傻吧 2024-07-21 12:34:02

如果您想通过编程接口“正确”执行 Google 搜索,请查看 Google API。 这些不仅是搜索 Google 的官方方式,而且如果 Google 更改其结果页面布局,它们也不太可能改变。

If you want to do Google searches "properly" through a programming interface, take a look at Google APIs. Not only are these the official way of searching Google, they are also not likely to change if Google changes their result page layout.

霓裳挽歌倾城醉 2024-07-21 12:34:02

正如 lacqui 建议的Google API 是他们希望您通过代码发出请求的方式。 不幸的是,我发现他们的文档是针对编写 AJAX 网页的人,而不是发出原始 HTTP 请求。 我使用 LiveHTTP header 来跟踪示例发出的 HTTP 请求,我发现 ddipaolo 的博客文章很有帮助。

还有一件事让我很困惑:它们将您限制为查询的前 64 个结果。 如果您只是为网络用户提供搜索框,通常这不是问题,但如果您尝试使用 Google 进行数据挖掘,则没有帮助。 我猜他们不希望你使用他们的 API 进行数据挖掘。 这个 64 的数字随着时间的推移而变化,并且在不同的搜索产品之间也有所不同。

更新:看来他们绝对不希望你进行数据挖掘。 最终,您会收到 403 错误,其中包含指向此 API 访问通知的链接。

请查看您正在使用的 API 的使用条款(链接在右侧边栏中)并确保合规性。 我们可能因违反以下使用条款之一而屏蔽您: 我们收到自动请求,例如抓取和预取。 禁止自动请求; 所有请求都必须是最终用户操作的结果。

他们还列出了其他违规行为,但我认为这是触发我的违规行为。 我可能得调查一下雅虎的 BOSS 服务。 好像没有那么多限制。

As lacqui suggested, the Google API's are the way they want you to make requests from code. Unfortunately, I found their documentation was aimed at people writing AJAX web pages, not making raw HTTP requests. I used LiveHTTP Headers to trace the HTTP requests that the samples made, and I found ddipaolo's blog post helpful.

One more thing that messed me up: they limit you to the first 64 results from a query. Usually not a problem if you are just providing web users with a search box, but not helpful if you're trying to use Google to go data mining. I guess they don't want you to go data mining using their API. That 64 number has changed over time and varies between search products.

Update: It appears they definitely do not want you to go data mining. Eventually, you get a 403 error with a link to this API access notice.

Please review the Terms of Use for the API(s) you are using (linked in the right sidebar) and ensure compliance. It is likely that we blocked you for one of the following Terms of Use violations: We received automated requests, such as scraping and prefetching. Automated requests are prohibited; all requests must be made as a result of an end-user action.

They also list other violations, but I think that's the one that triggered for me. I may have to investigate Yahoo's BOSS service. It doesn't seem to have as many restrictions.

牵你的手,一向走下去 2024-07-21 12:34:02

你这样做太频繁了。 谷歌设有限制,以防止被搜索机器人淹没。 您还可以尝试将用户代理设置为更类似于普通浏览器的内容。

You're doing it too often. Google has limits in place to prevent getting swamped by search bots. You can also try setting the user-agent to something that more closely resembles a normal browser.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文