云中的 MapReduce
除了 Amazon MapReduce 之外,我还有哪些其他选择来处理大量数据?
Except for Amazon MapReduce, what other options do I have to process a large amount of data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
Microsoft 也在 Windows Azure 上运行 Hadoop/MapReduce,但它受到有限的 CTP 限制,但是您可以通过以下链接提供您的信息和 CTP 访问请求:
https://www.hadooponazure.com/
Windows Azure 的基于 Apache Hadoop 的服务的开发者预览版可通过邀请获得。
除此之外,您还可以尝试 Google BigQuery,其中您必须首先将数据移动到 Google 临时存储,然后在其上运行 BigQuery。请记住,BigQuery 基于 Dremel,它与 MapReduce 类似,但由于基于列的搜索处理而速度更快。
还有另一种选择是使用 Mortar Data,因为他们使用了 python 和 Pig,可以智能地轻松编写作业并可视化结果。我觉得很有趣,请看一下:
http://mortardata.com/#!/how_it_works
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
DataStax Brisk 很好。
完整的发行版
HDFS 替代品
Hadoop MapReduce 替代品
请参阅:http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/
DataStax Brisk is good.
Full-on distributions
HDFS alternatives
Hadoop MapReduce alternatives
Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/
如果想使用机器集群实时处理大量数据(twitter feed、网站点击流)等,请查看 最近从 twitter 开源的“storm”
标准 Apache Hadoop 非常适合批量处理 PB 级数据,其中延迟不是问题。
如上所述,DataStax 的 Brisk 非常独特,因为您可以对实时数据使用 MapReduce 并行处理。
还有其他一些工作,例如 Hadoop Online,它允许使用管道进行处理。
Google BigQuery 显然是另一种选择,您可以使用 csv(分隔记录),并且无需任何设置即可进行切片和切块。它使用起来非常简单,但它是一项高级服务,您无需付费。处理的字节数(尽管每月前 100GB 是免费的)。
If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource'd from twitter recently
Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.
Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.
There are other efforts like Hadoop Online which allows to process using pipeline.
Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It's extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).
如果您想留在云中,您还可以启动 EC2 实例来创建永久的 Hadoop 集群。 Cloudera 在此处提供了大量有关设置此类集群的资源。
但是,此选项的成本效益低于 Amazon Elastic Mapreduce,除非您一天中有大量作业需要运行,从而使集群相当繁忙。
另一种选择是构建您自己的集群。 Hadoop 的一大优点是您可以将异构硬件拼凑成具有良好计算能力的集群。可以放在服务器机房机架中的那种。考虑到现有的旧硬件已经支付了费用,让这样一个集群运行的唯一成本是新驱动器,也许还有足够的内存条来最大化这些盒子的容量。那么这种方式的成本效益比亚马逊要好得多。唯一需要注意的是您是否有足够的带宽定期将所有数据拉入集群的 HDFS。
If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.
However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.
The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that's laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster's HDFS on a regular basis.
Google App Engine 也执行 MapReduce(至少目前是地图部分)。 http://code.google.com/p/appengine-mapreduce/
Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/