分布式数据聚合、查询、过滤:有 Hadoop/Mapreduce 的替代框架吗? (MR太慢了)
我们计划将大量指标数据放入某种 nosql 数据库中,可能是 cassandra,也可能是其他东西,跨多个服务器。
我们希望以 MapReduce 方式对数据进行计算(聚合数据所在机器上的数据,然后合并结果)。
我使用 Cassandra、Hadoop 和 MapReduce 制作了一个 POC。启动 MapReduce 作业和获取结果的开销对于我们的需求来说太高了。
在我们推出自己的框架之前,还有其他强调性能的分布式 Java 框架吗?
We're planning on putting a lot of metric data into some sort of nosql db, probably cassandra, maybe something else, across several servers.
We want to run calculations over the data, in a map reduce style (aggregate the data on the machine where it lives, then combine the results).
I made a POC using Cassandra and Hadoop and mapreduce. The overhead starting the mapreduce jobs and getting the results was too high for our needs.
Before we go roll our own, are there any other distributed java frameworks out there that emphasize performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
查看 Oracle Coherence,这是一种允许分区的分布式缓存虚拟机之间的数据,并行聚合和计算,水平扩展。
Look at Oracle Coherence, a distributed cache that allows one to partition data among VMs, aggregate and calculate in parallel, and scale horizontally.
看看storm。
来自文档:
Take a look a storm.
From documentation:
在我们推出自己的框架之前,还有其他强调性能的分布式 Java 框架吗?
- 每个框架都会尝试强调性能作为维度之一。Cassandra 是 MR 的输入源类型之一。使用 MR 将涉及映射任务开始/完成、洗牌和减少任务开始/完成的时间。 MR 是为批处理而设计的,而不是为即时结果而设计的。可以进行一定程度的调整,但您应该寻找实时或流处理框架。
看看HStreaming(注意我没用过)
Before we go roll our own, are there any other distributed java frameworks out there that emphasize performance?
- every framework will try to emphasize on performance as one of the dimension.Cassandra is one of the input source type for MR. Using MR will involve time for the map tasks to start/complete, shuffling and the reduce tasks to start/complete. MR is designed for batch processing and not for instantaneous results. Some level of tuning can be done, but you should be looking for real time or stream processing framework.
Take a look at HStreaming (Note that I haven't used it)
我看到商业列存储数据库 vertica 具有类似于 MapReduce 的功能。尽管您使用 SQL 语句的版本来表达聚合。我确信这个产品并不便宜,但......
I see the commercial column-store database vertica has functionality similar to map reduce. Though you express your aggregations with their version of SQL statements. I'm sure this product is not cheap, though...