数据挖掘关键字搜索结果总数的适当方法是什么?

发布于 2024-10-21 05:01:20 字数 755 浏览 2 评论 0原文

新手程序员和潜伏在这里,希望得到一些明智的建议。 :)

结合使用 Python、BeautifulSoup 和 Bing API,我能够通过以下代码找到我想要的内容:

import urllib2
from BeautifulSoup import BeautifulStoneSoup

Appid = #My Appid
query = #My query

soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text

因此,我想在几千个搜索词中执行此操作,并且想知道

  1. 执行此请求 是否一千次将被视为锤击服务器,
  2. 我应该采取哪些步骤来不锤击所述服务器(什么是最佳实践?),是否
  3. 有更便宜的(数据)方法使用任何主要搜索引擎 API 来做到这一点?

仅仅为了每个关键字获取一个数字而获取所有数据似乎不必要地昂贵,我想知道我是否错过了任何东西。

FWIW,在决定使用 Bing API 之前,我做了一些功课并尝试了 Google Search API(已弃用)和 Yahoo 的 BOSS API(很快将弃用并替换为付费服务)。我知道直接抓取页面被认为是不好的形式,所以我将直接抓取搜索引擎。

newbie programmer and lurker here, hoping for some sensible advice. :)

Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code:

import urllib2
from BeautifulSoup import BeautifulStoneSoup

Appid = #My Appid
query = #My query

soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text

So I'd like to do this across a few thousand search terms and was wondering if

  1. doing this request a thousand times would be construed as hammering the server,
  2. what steps I should take to not hammer said servers (what are best practices?), and
  3. is there a cheaper (data) way to do this using any of the major search engine APIs?

It just seems unnecessarily expensive to grab all that data just to grab one number per keyword and I was wondering if I missed anything.

FWIW, I did some homework and tried the Google Search API (deprecated) and Yahoo's BOSS API (soon to be deprecated and replaced with a paid service) before settling with the Bing API. I understand direct scraping of a page is considered poor form so I'll pass on scraping search engines directly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

桃扇骨 2024-10-28 05:01:20

我能想到的三种方法在以前当我必须进行大规模 URL 解析时很有帮助。

  1. HTTP 管道(另一个片段 这里
  2. 对每个IP的服务器请求进行速率限制(即每个IP每秒只能发出3个请求)。可以在这里找到一些建议:如何限制Python 中对 Web 服务的请求率?
  3. 通过内部代理服务发出请求,使用 http_proxy 将所有请求重定向到所述服务。然后,该代理服务将迭代一组网络接口并发出速率受限的请求。您可以使用 Twisted 来实现这一点。

There are three approaches I can think of that have helped previously when I had to do large scale URL resolution.

  1. HTTP Pipelining (another snippet here)
  2. Rate-limiting server requests per IP (i.e., each IP can only issue 3 requests / second). Some suggestions can be found here: How to limit rate of requests to web services in Python?
  3. Issuing requests through an internal proxy service, using http_proxy to redirect all requests to said service. This proxy service will then iterate over a set of network interfaces and issue rate limited requests. You can use Twisted for that.
禾厶谷欠 2024-10-28 05:01:20

关于您的问题 1,Bing 有一个 API 基础 PDF 文件以人类可读的形式总结条款和条件。在“你必须做什么”部分。其中包括以下声明:

将您的使用次数限制在 7 以内
每个 IP 的每秒查询次数 (QPS)
地址。您可能会被允许
在某些情况下超过此限制
有条件,但必须得到批准
通过讨论
[电子邮件受保护]

如果这只是一个一次性脚本,那么您不需要做任何更复杂的事情,只需在发出请求之间添加一个 sleep 即可,这样您一秒钟只发出几个请求。如果情况更复杂,例如这些请求是作为 Web 服务的一部分发出的,则 Mahmoud Abdelkader 的回答 应该对您有帮助。

With regard to your question 1, Bing has an API Basics PDF file that summarizes the terms and conditions in human-readable form. In the "What you must do" section. That includes the following statement:

Restrict your usage to less than 7
queries per second (QPS) per IP
address. You may be permitted to
exceed this limit under some
conditions, but this must be approved
through discussion with
[email protected].

If this is just a one-off script, you don't need to do anything more complex than just adding a sleep between making requests, so that you're making only a couple of requests a second. If the situation is more complex, e.g. these requests are being made as part of a web service, the suggestions in Mahmoud Abdelkader's answer should help you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文