如何使用 HBase 和 Hadoop 服务实时流量并执行分析? (单个集群与单独集群?)

发布于 2024-11-19 06:58:53 字数 954 浏览 9 评论 0原文

我们的主要目的是使用 Hadoop 进行分析。在这个用例中,我们进行批处理,因此吞吐量比延迟更重要,这意味着 HBase 不一定适合(尽管更接近实时分析听起来很有吸引力)。我们正在使用 Hive,到目前为止我们很喜欢它。

虽然分析是我们在不久的将来希望通过 Hadoop 做的主要事情,但我们也希望将部分业务迁移到 HBase 并从中提供实时流量。存储在那里的数据与我们在分析中使用的数据相同,我想知道我们是否可以只有一个系统用于实时流量和分析。

我读过很多报告,似乎大多数组织都选择使用单独的集群来服务流量和分析。出于稳定性目的,这似乎是一个合理的选择,因为我们计划让很多人编写 Hive 查询,而编写错误的查询可能会损害实时操作。

现在我的问题是:如何协调这两个不同的用例(提供实时流量并进行批量分析)?组织是否使用系统将所有数据写入两个独立的集群中?或者是否可以使用单个集群来实现这一点,其中一些节点提供实时流量,而其他节点仅进行分析?

我的想法是,我们也许可以让所有数据进入用于服务实时流量的节点,并让 HDFS 复制机制管理将数据复制到用于分析的节点(将复制增加到高于默认 3 在这种情况下可能有意义)。 Hadoop 可以了解特殊的网络拓扑,并且它具有始终将至少一个副本复制到不同机架的功能,因此这似乎与我所描述的内容很好地吻合。

专用于实时流量的节点可以设置为具有零(或很少)的映射和归约槽,以便所有 Hive 查询最终都由专用于分析的节点处理。

专用于分析的节点始终会落后于专用于服务实时流量的节点,但这似乎不是问题。

这样的解决方案有意义吗?我认为拥有一个集群可能比拥有两个集群更简单,但这是否会带来更大的风险?是否有公司使用 HBase 集群来服务实时流量,同时在其上运行批量分析作业的已知案例?

我很想听听您对此的意见:)!

谢谢。

编辑:那么轻快呢?它基于 Cassandra 而不是 HBase,但它似乎完全是为了我所描述的(混合集群)而设计的。以前有人用过它吗?成熟了吗?

-- 菲利克斯

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.

Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.

I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.

Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?

What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.

The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.

The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.

Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?

I'd love to get your opinions on this :) !

Thanks.

EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?

--
Felix

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

迷荒 2024-11-26 06:58:53

您的方法有一些问题......即使在机架感知模式下,如果您有多个机架,我不知道如何保证您的节点将在这些节点上复制。如果您丢失了一个“活动”节点,那么您将在一段时间内复制不足,并且无法访问该数据。

HBase 在资源方面非常贪婪,我发现它在高负载情况下与其他数据库(在内存和 CPU 方面)不能很好地配合。您还提到,大量分析会影响现场表现,这也是事实。

在我的集群中,我们大量使用 Hadoop 来预处理数据以摄取到 HBase 中。我们会做一些事情,比如丰富、过滤掉我们不需要的记录、转换、汇总等。如果你想做这样的事情,我建议首先将数据发送到 Hadoop 集群上的 HDFS,然后将其卸载到您的 HBase 集群。

没有什么可以阻止您将 HBase 集群和 Hadoop 集群放在同一网络背板上。我建议不要使用混合节点,只需将一些节点专用于 Hadoop 集群,将一些节点专用于 Hbase 集群。两者之间的网络传输将非常快捷。

只是我个人的经历,所以我不确定其中有多少相关性。我希望您发现它有用并祝您好运!

Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.

HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.

In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.

There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.

Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!

七堇年 2024-11-26 06:58:53

我认为这种解决方案可能有意义,因为 MR 主要是 CPU 密集型的,而 HBASE 是一个需要内存的野兽。我们真正需要的是妥善安排资源管理。我认为可以通过以下方式实现:
a)CPU。我们可以定义每个插槽的 MR 映射器/减速器的最大数量,并假设每个映射器都是单线程的,我们可以限制 MR 的 CPU 消耗。其余的将转到 HBASE。
b) 内存。我们可以限制映射器和缩减器的内存,其余的交给 HBASE。
c) 我认为我们无法正确管理 HDFS 带宽共享,但我不认为这对 HBASE 来说应该是一个问题 - 因为对它而言,磁盘操作不在关键路径上。

I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文