数据挖掘关键字搜索结果总数的适当方法是什么?
新手程序员和潜伏在这里,希望得到一些明智的建议。 :)
结合使用 Python、BeautifulSoup 和 Bing API,我能够通过以下代码找到我想要的内容:
import urllib2
from BeautifulSoup import BeautifulStoneSoup
Appid = #My Appid
query = #My query
soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text
因此,我想在几千个搜索词中执行此操作,并且想知道
- 执行此请求 是否一千次将被视为锤击服务器,
- 我应该采取哪些步骤来不锤击所述服务器(什么是最佳实践?),是否
- 有更便宜的(数据)方法使用任何主要搜索引擎 API 来做到这一点?
仅仅为了每个关键字获取一个数字而获取所有数据似乎不必要地昂贵,我想知道我是否错过了任何东西。
FWIW,在决定使用 Bing API 之前,我做了一些功课并尝试了 Google Search API(已弃用)和 Yahoo 的 BOSS API(很快将弃用并替换为付费服务)。我知道直接抓取页面被认为是不好的形式,所以我将直接抓取搜索引擎。
newbie programmer and lurker here, hoping for some sensible advice. :)
Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code:
import urllib2
from BeautifulSoup import BeautifulStoneSoup
Appid = #My Appid
query = #My query
soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web"))
totalResults = soup.find('web:total').text
So I'd like to do this across a few thousand search terms and was wondering if
- doing this request a thousand times would be construed as hammering the server,
- what steps I should take to not hammer said servers (what are best practices?), and
- is there a cheaper (data) way to do this using any of the major search engine APIs?
It just seems unnecessarily expensive to grab all that data just to grab one number per keyword and I was wondering if I missed anything.
FWIW, I did some homework and tried the Google Search API (deprecated) and Yahoo's BOSS API (soon to be deprecated and replaced with a paid service) before settling with the Bing API. I understand direct scraping of a page is considered poor form so I'll pass on scraping search engines directly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我能想到的三种方法在以前当我必须进行大规模 URL 解析时很有帮助。
http_proxy
将所有请求重定向到所述服务。然后,该代理服务将迭代一组网络接口并发出速率受限的请求。您可以使用 Twisted 来实现这一点。There are three approaches I can think of that have helped previously when I had to do large scale URL resolution.
http_proxy
to redirect all requests to said service. This proxy service will then iterate over a set of network interfaces and issue rate limited requests. You can use Twisted for that.关于您的问题 1,Bing 有一个 API 基础 PDF 文件以人类可读的形式总结条款和条件。在“你必须做什么”部分。其中包括以下声明:
如果这只是一个一次性脚本,那么您不需要做任何更复杂的事情,只需在发出请求之间添加一个
sleep
即可,这样您一秒钟只发出几个请求。如果情况更复杂,例如这些请求是作为 Web 服务的一部分发出的,则 Mahmoud Abdelkader 的回答 应该对您有帮助。With regard to your question 1, Bing has an API Basics PDF file that summarizes the terms and conditions in human-readable form. In the "What you must do" section. That includes the following statement:
If this is just a one-off script, you don't need to do anything more complex than just adding a
sleep
between making requests, so that you're making only a couple of requests a second. If the situation is more complex, e.g. these requests are being made as part of a web service, the suggestions in Mahmoud Abdelkader's answer should help you.