HBase多线程扫描真的很慢

发布于 2024-12-27 17:20:12 字数 508 浏览 5 评论 0原文

我正在使用 HBase 来存储一些时间序列数据。根据 O'Reilly HBase 书中的建议,我使用的行键是带有加盐前缀的数据的时间戳。为了查询这些数据,我生成了多个线程,这些线程在一系列时间戳上实现扫描,每个线程处理特定的前缀。然后将结果放入并发哈希图中。

当线程尝试执行扫描时会出现问题。串行完成时通常需要大约 5600 毫秒的查询在生成 6 个线程(对应于 6 个盐/区域服务器)时需要 40000 到 80000 毫秒。

我尝试使用 HTablePools 来解决我认为 HTable 不是线程安全的问题,但这并没有带来任何更好的性能。

特别是,当我执行这部分代码时,我注意到速度显着减慢:

for(Result res : rowScanner){
//add Result To HashMap

通过日志记录,我注意到每次通过循环的条件时,我都会经历很多秒的延迟。如果我强制线程串行执行,这些延迟就不会发生。

我认为资源锁定存在某种问题,但我只是看不到它。

I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.

Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).

I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.

in particular I am noticing a significant slow down when I hit this portion of my code:

for(Result res : rowScanner){
//add Result To HashMap

Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.

I assume that there is some kind of issue with resource locking but I just can't see it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

猥琐帝 2025-01-03 17:20:12

确保您设置 BatchSize在扫描对象(用于创建扫描程序的对象)上缓存。这些控制一次通过网络传输多少行,以及将多少行保留在内存中以便在 RegionServer 本身上快速检索。默认情况下,它们都太低而效率低下。 BatchSize 尤其会显着提高您的性能。

编辑:根据评论,听起来您可能在服务器或客户端上进行交换,或者 RegionServer 的 BlockCache 中可能没有足够的空间来满足您的扫描仪的要求。您给 RegionServer 分配了多少堆?你检查过是否正在交换吗?请参阅 如何找出 Linux 中哪些进程正在交换?

此外,您可能希望减少并行扫描的数量,并使每个扫描仪读取更多行。我发现在我的集群上,并行扫描相对于串行扫描几乎没有任何改进,因为我受网络限制。 如果您的网络已达到极限,并行扫描实际上会使情况变得更糟。

Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.

EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.

Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.

楠木可依 2025-01-03 17:20:12

您是否考虑过使用 MapReduce,也许只需一个映射器即可轻松地将扫描拆分到区域服务器上?这比担心 HBase 客户端库中的线程和同步更容易。 Result 类不是线程安全的。 TableMapReduceUtil 可以轻松设置作业。

Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文