使用 urllib2 避免 503 错误

发布于 2024-11-19 03:33:07 字数 1295 浏览 2 评论 0原文

我是用 python 进行网络抓取的新手,所以我不知道我这样做是否正确。

我正在使用一个调用 BeautifulSoup 的脚本来解析 google 搜索的前 10 页中的 URL。使用 stackoverflow.com 进行测试,开箱即用,效果很好。我在另一个网站上测试了几次,试图看看该脚本是否真的可以处理更高的谷歌页面请求,然后它就对我进行了 503 处理。我切换到另一个 URL 进行测试,并处理了几个低页面请求,然后也是 503。现在我传递给它的每个 URL 都是 503'ing。有什么建议吗?

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup

### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
    url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
    page = opener.open(url)
    soup = BeautifulSoup(page)

    ### Parse and find
    ### Looks like google contains URLs in <cite> tags.
    ### So for each cite tag on each page (10), print its contents (url)
    for cite in soup.findAll('cite'):
        print cite.text

I'm new to web scraping with python, so I don't know if I'm doing this right.

I'm using a script that calls BeautifulSoup to parse the URLs from the first 10 pages of a google search. Tested with stackoverflow.com, worked just fine out-of-the-box. I tested with another site a few times, trying to see if the script was really working with higher google page requests, then it 503'd on me. I switched to another URL to test and worked for a couple, low-page requests, then also 503'd. Now every URL I pass to it is 503'ing. Any suggestions?

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup

### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
    url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
    page = opener.open(url)
    soup = BeautifulSoup(page)

    ### Parse and find
    ### Looks like google contains URLs in <cite> tags.
    ### So for each cite tag on each page (10), print its contents (url)
    for cite in soup.findAll('cite'):
        print cite.text

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蓬勃野心 2024-11-26 03:33:07

Google 服务条款不允许自动查询。
请参阅这篇文章了解信息:
来自您计算机的异常流量
以及 Google 服务条款

Automated querying is not permitted by Google Terms of Service.
See this article for information:
Unusual traffic from your computer
and also Google Terms of service

潇烟暮雨 2024-11-26 03:33:07

正如 Ettore 所说,抓取搜索结果违反了我们的服务条款。不过,请查看 WebSearch api,特别是文档的底部部分,它应该为您提供有关如何从非 javascipt 环境访问 API 的提示。

As Ettore said, scraping the search results is against our ToS. However check out the WebSearch api, specifically the bottom section of the documentation which should give you a hint about how to access the API from non-javascipt environments.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文