使用 urllib2 避免 503 错误
我是用 python 进行网络抓取的新手,所以我不知道我这样做是否正确。
我正在使用一个调用 BeautifulSoup 的脚本来解析 google 搜索的前 10 页中的 URL。使用 stackoverflow.com 进行测试,开箱即用,效果很好。我在另一个网站上测试了几次,试图看看该脚本是否真的可以处理更高的谷歌页面请求,然后它就对我进行了 503 处理。我切换到另一个 URL 进行测试,并处理了几个低页面请求,然后也是 503。现在我传递给它的每个 URL 都是 503'ing。有什么建议吗?
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
for cite in soup.findAll('cite'):
print cite.text
I'm new to web scraping with python, so I don't know if I'm doing this right.
I'm using a script that calls BeautifulSoup to parse the URLs from the first 10 pages of a google search. Tested with stackoverflow.com, worked just fine out-of-the-box. I tested with another site a few times, trying to see if the script was really working with higher google page requests, then it 503'd on me. I switched to another URL to test and worked for a couple, low-page requests, then also 503'd. Now every URL I pass to it is 503'ing. Any suggestions?
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
for cite in soup.findAll('cite'):
print cite.text
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Google 服务条款不允许自动查询。
请参阅这篇文章了解信息:
来自您计算机的异常流量
以及 Google 服务条款
Automated querying is not permitted by Google Terms of Service.
See this article for information:
Unusual traffic from your computer
and also Google Terms of service
正如 Ettore 所说,抓取搜索结果违反了我们的服务条款。不过,请查看 WebSearch api,特别是文档的底部部分,它应该为您提供有关如何从非 javascipt 环境访问 API 的提示。
As Ettore said, scraping the search results is against our ToS. However check out the WebSearch api, specifically the bottom section of the documentation which should give you a hint about how to access the API from non-javascipt environments.