大规模数据处理 Hbase 与 Cassandra

发布于 2024-12-02 09:31:35 字数 460 浏览 7 评论 0原文

在研究了大规模数据存储解决方案后，我差一点就选择了 Cassandra。但普遍认为 Hbase 是大规模数据处理和分析的更好解决方案。

虽然两者都是相同的键/值存储，并且都/可以运行（最近是 Cassandra）Hadoop 层，那么当需要对大数据进行处理/分析时，是什么使 Hadoop 成为更好的候选者。

我还在以下位置找到了有关两者的详细信息 http://ria101.wordpress.com/2010/02/24 /hbase-vs-cassandra-why-we-moved/

但我仍在寻找 Hbase 的具体优势。

虽然我更相信 Cassandra，因为它添加节点的简单性、无缝复制和无故障点功能。而且它还保留了二级索引功能，因此这是一个很好的优点。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瞄了个咪的 2024-12-09 09:31:35

作为一名 Cassandra 开发人员，我更擅长回答问题的另一面：

Cassandra 的扩展性更好。 Cassandra 已知可扩展至集群中超过 400 个节点；当 Facebook 在 HBase 之上部署 Messaging 时，他们必须将其分片到 100 节点 HBase子集群。
Cassandra 支持数百甚至数千个 ColumnFamilies。 “HBase 目前不适用于超过两个或三个列族。”
作为一个没有“特殊”节点或进程的完全分布式系统，Cassandra 是设置和操作更简单，更容易排除故障，而且更强大。
Cassandra 对多主复制的支持意味着您不仅可以获得多个数据中心的明显强大功能（地理冗余、本地延迟），而且还可以使用它们之间的实时双向复制。如果你不将这些工作负载分开，它们就会发生激烈的竞争。
由于每个 Cassandra 节点都管理自己的本地存储，因此 Cassandra 具有显着的性能优势，而且不太可能显着缩小。（例如，标准做法是将 Cassandra 提交日志放在单独的设备上，这样它就可以不受读取请求中的随机 I/O 的阻碍进行顺序写入。）
Cassandra 允许您选择希望它要求一致性的强度每个操作的基础。有时这会被误解为“Cassandra 无法提供强一致性”，但这是不正确的。
Cassandra 提供 RandomPartitioner 以及更像 Bigtable 的 OrderedPartitioner。 RandomPartitioner 不太容易出现热点。
Cassandra 提供堆内或堆外缓存，其性能与 memcached 相当，但没有缓存一致性问题或需要额外移动部件的复杂性
非 Java 客户端不是二等公民

据我所知，HBase 目前的主要优势（ HBase 0.90.4 和 Cassandra 0.8.4）的问题是 Cassandra 尚不支持透明数据压缩。（这已为 Cassandra 1.0 添加，预计十月初发布，但今天这对 HBase 来说是一个真正的优势。）HBase 还可以针对 Hadoop 批处理完成的范围扫描进行更好的优化。

还有一些事情不一定更好，也不一定更坏，只是不同而已。 HBase 更严格地遵守 Bigtable 数据模型，其中每列都隐式进行版本控制。 Cassandra 放弃了版本控制，并添加了 SuperColumns。

As a Cassandra developer, I'm better at answering the other side of the question:

Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.
Cassandra supports hundreds, even thousands of ColumnFamilies. "HBase currently does not do well with anything above two or three column families."
As a fully distributed system with no "special" nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.
Cassandra's support for multi-master replication means that not only do you get the obvious power of multiple datacenters -- geographic redundancy, local latencies -- but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don't split those workloads apart they will contend spectacularly.
Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it's standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)
Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as "Cassandra does not give you strong consistency," but that is incorrect.
Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.
Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts
Non-Java clients are not second-class citizens

To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.

There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.

回复收藏 0 原文

澜川若宁 2024-12-09 09:31:35

试图确定哪一个最适合你实际上取决于你将用它做什么，它们各有优点，如果没有更多细节，它就更像是一场宗教战争。您引用的那篇文章也已经存在一年多了，从那时起，两者都经历了许多变化。另请记住，我不熟悉 Cassandra 的最新发展。

话虽如此，我将解释 HBase 提交者 Andrew Purtell 并添加一些我自己的经验：

HBase 处于较大的生产环境（1000 个节点）中，尽管这仍然在 Cassandra 约 400 个节点安装的范围内，所以它确实是一个。
HBase 和 Cassandra 都支持集群/数据中心之间的复制。我相信 HBase 向用户公开了更多内容，因此看起来更复杂，但你也获得了更大的灵活性。
如果您的应用程序需要强一致性，那么 HBase 可能更适合。它从一开始就被设计为保持一致。例如，它允许更简单地实现原子计数器（我认为 Cassandra 刚刚获得它们）以及 Check 和 Put 操作。
写入性能非常好，据我了解，这是 Facebook 选择 HBase 作为其信使的原因之一。
我不确定 Cassandra 的有序分区器的当前状态，但在过去它需要手动重新平衡。如果您愿意，HBase 可以为您处理这个问题。有序分区器对于 Hadoop 风格的处理很重要。
Cassandra 和 HBase 都很复杂，Cassandra 只是隐藏得更好。 HBase 通过使用 HDFS 进行存储来更多地公开它，如果您查看代码库 Cassandra 也是分层的。如果你比较Dynamo和Bigtable的论文，你会发现Cassandra的操作理论实际上更复杂。
HBase 有更多 FWIW 单元测试。
所有Cassandra RPC都是Thrift，HBase有Thrift、REST和原生Java。 Thrift 和 REST 只提供全部客户端 API 的一个子集，但如果您想要纯粹的速度，可以使用本机 Java 客户端。
点对点和主对从都有优点。主从设置通常使调试更容易，并降低了相当多的复杂性。
HBase 不仅仅与传统的 HDFS 绑定，您可以根据需要更改底层存储。 MapR 看起来很有趣，虽然我自己没有使用过，但我听说过一些好东西。

Trying to determine which is best for you really depends on what you are going to use it for, they each have their advantages and without any more details it becomes more of a religious war. That post you referenced is also more than a year old and both have gone through many changes since then. Please also keep in mind I am not familiar with the more recent Cassandra developments.

Having said that, I'll paraphrase HBase committer Andrew Purtell and add some of my own experiences:

HBase is in larger production environments (1000 nodes) although that is still in the ballpark of Cassandra's ~400 node installs so its really a marginal difference.
HBase and Cassandra both supports replication between clusters/datacenters. I believe HBase's exposes more to the user so it appears more complicated but then you also get more flexibility.
If strong consistency is what your application needs then HBase is likely a better fit. It is designed from the ground up to be consistent. For example it allows for simpler implementation of atomic counters (I think Cassandra just got them) as well as Check and Put operations.
Write performance is great, from what I understand that was one of the reasons Facebook went with HBase for their messenger.
I'm not sure of the current state of Cassandra's ordered partitioner, but in the past it required manual rebalancing. HBase handles that for you if you want. The ordered partitioner is important for Hadoop style processing.
Cassandra and HBase are both complex, Cassandra just hides it better. HBase exposes it more via using HDFS for its storage, if you look at the codebase Cassandra is just as layered. If you compare the Dynamo and Bigtable papers you can see that Cassandra's theory of operation is actually more complex.
HBase has more unit tests FWIW.
All Cassandra RPC is Thrift, HBase has a Thrift, REST and native Java. The Thrift and REST do only offer a subset of the total client API but if you want pure speed the native Java client is there.
There are advantages to both peer to peer and master to slave. The master - slave setup generally makes it easier to debug and reduces quite a bit of complexity.
HBase is not tied to only traditional HDFS, you can change out your underlying storage depending on your needs. MapR looks quite interesting and I have heard good things although I have not used it myself.

回复收藏 0 原文

半边脸i 2024-12-09 09:31:35

使用 100 个节点的 hBase 集群的原因并不是因为 HBase 无法扩展到更大的规模。这是因为以滚动方式进行 hBase/HDFS 软件升级更容易，而不会影响整个服务。另一个原因是防止单个 NameNode 成为整个服务的 SPOF。此外，HBase 还用于各种服务（不仅仅是 FB 消息），谨慎的做法是采用千篇一律的方法来基于 100 节点 Pod 方法设置大量 HBase 集群。 100这个数字是临时的，我们没有关注100是否是最优的。

回复收藏 0 原文

~没有更多了~