大规模数据处理 Hbase 与 Cassandra
在研究了大规模数据存储解决方案后,我差一点就选择了 Cassandra。但普遍认为 Hbase 是大规模数据处理和分析的更好解决方案。
虽然两者都是相同的键/值存储,并且都/可以运行(最近是 Cassandra)Hadoop 层,那么当需要对大数据进行处理/分析时,是什么使 Hadoop 成为更好的候选者。
我还在以下位置找到了有关两者的详细信息 http://ria101.wordpress.com/2010/02/24 /hbase-vs-cassandra-why-we-moved/
但我仍在寻找 Hbase 的具体优势。
虽然我更相信 Cassandra,因为它添加节点的简单性、无缝复制和无故障点功能。而且它还保留了二级索引功能,因此这是一个很好的优点。
I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.
While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.
I also found good details about both at
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/
but I'm still looking for concrete advantages of Hbase.
While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
作为一名 Cassandra 开发人员,我更擅长回答问题的另一面:
据我所知,HBase 目前的主要优势( HBase 0.90.4 和 Cassandra 0.8.4)的问题是 Cassandra 尚不支持透明数据压缩。 (这已为 Cassandra 1.0 添加,预计十月初发布,但今天这对 HBase 来说是一个真正的优势。)HBase 还可以针对 Hadoop 批处理完成的范围扫描进行更好的优化。
还有一些事情不一定更好,也不一定更坏,只是不同而已。 HBase 更严格地遵守 Bigtable 数据模型,其中每列都隐式进行版本控制。 Cassandra 放弃了版本控制,并添加了 SuperColumns。
As a Cassandra developer, I'm better at answering the other side of the question:
To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.
There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.
试图确定哪一个最适合你实际上取决于你将用它做什么,它们各有优点,如果没有更多细节,它就更像是一场宗教战争。您引用的那篇文章也已经存在一年多了,从那时起,两者都经历了许多变化。另请记住,我不熟悉 Cassandra 的最新发展。
话虽如此,我将解释 HBase 提交者 Andrew Purtell 并添加一些我自己的经验:
HBase 处于较大的生产环境(1000 个节点)中,尽管这仍然在 Cassandra 约 400 个节点安装的范围内,所以它确实是一个 。
HBase 和 Cassandra 都支持集群/数据中心之间的复制。我相信 HBase 向用户公开了更多内容,因此看起来更复杂,但你也获得了更大的灵活性。
如果您的应用程序需要强一致性,那么 HBase 可能更适合。它从一开始就被设计为保持一致。例如,它允许更简单地实现原子计数器(我认为 Cassandra 刚刚获得它们)以及 Check 和 Put 操作。
写入性能非常好,据我了解,这是 Facebook 选择 HBase 作为其信使的原因之一。
我不确定 Cassandra 的有序分区器的当前状态,但在过去它需要手动重新平衡。如果您愿意,HBase 可以为您处理这个问题。有序分区器对于 Hadoop 风格的处理很重要。
Cassandra 和 HBase 都很复杂,Cassandra 只是隐藏得更好。 HBase 通过使用 HDFS 进行存储来更多地公开它,如果您查看代码库 Cassandra 也是分层的。如果你比较Dynamo和Bigtable的论文,你会发现Cassandra的操作理论实际上更复杂。
HBase 有更多 FWIW 单元测试。
所有Cassandra RPC都是Thrift,HBase有Thrift、REST和原生Java。 Thrift 和 REST 只提供全部客户端 API 的一个子集,但如果您想要纯粹的速度,可以使用本机 Java 客户端。
点对点和主对从都有优点。主从设置通常使调试更容易,并降低了相当多的复杂性。
HBase 不仅仅与传统的 HDFS 绑定,您可以根据需要更改底层存储。 MapR 看起来很有趣,虽然我自己没有使用过,但我听说过一些好东西。
Trying to determine which is best for you really depends on what you are going to use it for, they each have their advantages and without any more details it becomes more of a religious war. That post you referenced is also more than a year old and both have gone through many changes since then. Please also keep in mind I am not familiar with the more recent Cassandra developments.
Having said that, I'll paraphrase HBase committer Andrew Purtell and add some of my own experiences:
HBase is in larger production environments (1000 nodes) although that is still in the ballpark of Cassandra's ~400 node installs so its really a marginal difference.
HBase and Cassandra both supports replication between clusters/datacenters. I believe HBase's exposes more to the user so it appears more complicated but then you also get more flexibility.
If strong consistency is what your application needs then HBase is likely a better fit. It is designed from the ground up to be consistent. For example it allows for simpler implementation of atomic counters (I think Cassandra just got them) as well as Check and Put operations.
Write performance is great, from what I understand that was one of the reasons Facebook went with HBase for their messenger.
I'm not sure of the current state of Cassandra's ordered partitioner, but in the past it required manual rebalancing. HBase handles that for you if you want. The ordered partitioner is important for Hadoop style processing.
Cassandra and HBase are both complex, Cassandra just hides it better. HBase exposes it more via using HDFS for its storage, if you look at the codebase Cassandra is just as layered. If you compare the Dynamo and Bigtable papers you can see that Cassandra's theory of operation is actually more complex.
HBase has more unit tests FWIW.
All Cassandra RPC is Thrift, HBase has a Thrift, REST and native Java. The Thrift and REST do only offer a subset of the total client API but if you want pure speed the native Java client is there.
There are advantages to both peer to peer and master to slave. The master - slave setup generally makes it easier to debug and reduces quite a bit of complexity.
HBase is not tied to only traditional HDFS, you can change out your underlying storage depending on your needs. MapR looks quite interesting and I have heard good things although I have not used it myself.
使用 100 个节点的 hBase 集群的原因并不是因为 HBase 无法扩展到更大的规模。这是因为以滚动方式进行 hBase/HDFS 软件升级更容易,而不会影响整个服务。另一个原因是防止单个 NameNode 成为整个服务的 SPOF。此外,HBase 还用于各种服务(不仅仅是 FB 消息),谨慎的做法是采用千篇一律的方法来基于 100 节点 Pod 方法设置大量 HBase 集群。 100这个数字是临时的,我们没有关注100是否是最优的。
The reason for using 100 node hBase clusters is not because HBase does not scale to larger sizes. It is because it is easier to do hBase/HDFS software upgrades on a rolling fashion without bringing down your entire service. Another reason is to prevent a single NameNode to be a SPOF for the entire service. Also, HBase is being used for various services (not just FB messages) and it is prudent to have a cookie-cutter approach to setting up numerous HBase clusters based on a 100-node pod approach. The number 100 is adhoc, we have not focussed on whether 100 is optimal or not.