Python - 抓取 Google 的简单方法,下载给定搜索的前 N ​​个点击(整个 .html 文档)?

发布于 2024-10-22 19:20:35 字数 220 浏览 4 评论 0原文

有没有一种简单的方法可以抓取 Google 并编写给定搜索的前 N ​​个(例如 1000 个).html(或其他)文档的文本(只是文本)?

举个例子,想象一下搜索短语“big bad Wolf”并仅下载前 1000 个点击中的文本 - 即,实际上从这 1000 个网页(但只是这些页面,而不是整个网站)下载文本。

我假设这会使用 urllib2 库?如果有帮助的话我使用Python 3.1。

Is there an easy way to scrape Google and write the text (just the text) of the top N (say, 1000) .html (or whatever) documents for a given search?

As an example, imagine searching for the phrase "big bad wolf" and downloading just the text from the top 1000 hits -- i.e., actually downloading the text from those 1000 web pages (but just those pages, not the entire site).

I'm assuming this would use the urllib2 library? I use Python 3.1 if that helps.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

时光是把杀猪刀 2024-10-29 19:20:35

查看 BeautifulSoup 从网页中抓取内容。它应该能够非常容忍损坏的网页,这会有所帮助,因为并非所有结果都是格式良好的。因此,您应该能够:

Check out BeautifulSoup for scraping the content out of web pages. It is supposed to be very tolerant of broken web pages which will help because not all results are well formed. So you should be able to:

痴梦一场 2024-10-29 19:20:35

以编程方式从 Google 获取结果的官方方法是使用 Google 的自定义搜索 API。正如 icktoofay 评论,其他方法(例如直接抓取结果或使用 xgoogle 模块) 中断 Google 服务条款。因此,您可能需要考虑使用其他搜索引擎的 API,例如 Bing APIYahoo! 的服务

The official way to get results from Google programmatically is to use Google's Custom Search API. As icktoofay comments, other approaches (such as directly scraping the results or using the xgoogle module) break Google's terms of service. Because of that, you might want to consider using the API from another search engine, such as the Bing API or Yahoo!'s service.

感受沵的脚步 2024-10-29 19:20:35

如前所述,抓取 Google 内容违反了他们的服务条款。也就是说,这可能不是您正在寻找的答案。

有一个 PHP 脚本可以完美地抓取 Google:http://google-scraper.squabbel.com/ 只需给它一个关键字,您想要的结果数,它就会为您返回所有结果。只需解析返回的 URL,使用 urllib 或curl 提取 HTML 源代码,就完成了。

不过,除非您拥有超过 100 个代理服务器,否则您也不应该尝试抓取 Google 信息。经过几次尝试后,他们很容易暂时禁止您的 IP。

As mentioned, scraping Google violates their TOS. That said, that's probably not the answer you're looking for.

There's a PHP script available that does a perfect job of scraping Google: http://google-scraper.squabbel.com/ Just give it a keyword, # of results you want, and it'll return all the results for you. Just parse for the URLs returned, use urllib, or curl to extract the HTML source, and you're done.

You also really shouldn't attempt to scrape Google unless you got more than 100 proxy servers though. They'll easily ban your IP temporarily after a few attempts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文