数据挖掘关键字搜索结果总数的适当方法是什么？

发布于 2024-10-21 05:01:20 字数 755 浏览 2 评论 0原文

新手程序员和潜伏在这里，希望得到一些明智的建议。 :)

结合使用 Python、BeautifulSoup 和 Bing API，我能够通过以下代码找到我想要的内容：

import urllib2
from BeautifulSoup import BeautifulStoneSoup

Appid = #My Appid
query = #My query

soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text

因此，我想在几千个搜索词中执行此操作，并且想知道

执行此请求是否一千次将被视为锤击服务器，
我应该采取哪些步骤来不锤击所述服务器（什么是最佳实践？），是否
有更便宜的（数据）方法使用任何主要搜索引擎 API 来做到这一点？

仅仅为了每个关键字获取一个数字而获取所有数据似乎不必要地昂贵，我想知道我是否错过了任何东西。

FWIW，在决定使用 Bing API 之前，我做了一些功课并尝试了 Google Search API（已弃用）和 Yahoo 的 BOSS API（很快将弃用并替换为付费服务）。我知道直接抓取页面被认为是不好的形式，所以我将直接抓取搜索引擎。

原文

newbie programmer and lurker here, hoping for some sensible advice. :)

Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code:

import urllib2
from BeautifulSoup import BeautifulStoneSoup

Appid = #My Appid
query = #My query

soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text

So I'd like to do this across a few thousand search terms and was wondering if

doing this request a thousand times would be construed as hammering the server,
what steps I should take to not hammer said servers (what are best practices?), and
is there a cheaper (data) way to do this using any of the major search engine APIs?

It just seems unnecessarily expensive to grab all that data just to grab one number per keyword and I was wondering if I missed anything.

FWIW, I did some homework and tried the Google Search API (deprecated) and Yahoo's BOSS API (soon to be deprecated and replaced with a paid service) before settling with the Bing API. I understand direct scraping of a page is considered poor form so I'll pass on scraping search engines directly.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

桃扇骨 2024-10-28 05:01:20

我能想到的三种方法在以前当我必须进行大规模 URL 解析时很有帮助。

HTTP 管道（另一个片段这里）
对每个IP的服务器请求进行速率限制（即每个IP每秒只能发出3个请求）。可以在这里找到一些建议：如何限制Python 中对 Web 服务的请求率？
通过内部代理服务发出请求，使用 http_proxy 将所有请求重定向到所述服务。然后，该代理服务将迭代一组网络接口并发出速率受限的请求。您可以使用 Twisted 来实现这一点。