如果在超过 1 台机器上运行 HBase,在 Hbase 上运行扫描是否会更快?

发布于 2025-01-05 23:19:59 字数 169 浏览 2 评论 0原文

我需要对 HBase 表进行扫描以进行即席查询。目前我只使用一个节点。我想知道在一台以上的机器上以分布式模式运行 HBase 是否会使其速度更快。目前,在 m1.large EC2 机器上扫描 300 万行大约需要 5 分钟。 欢迎任何有关如何加快扫描速度的想法。目前,我启用了 scan.setCaching 这有很大帮助

I need to do a scan on HBase table for my adhoc queries. Currently I'm using just a single node. I was wondering if running HBase in distributed mode on more than 1 machine might make it faster. It currently takes around 5 mins to do a scan on 3 million rows on a m1.large EC2 machine.
Any ideas on how to make scan faster are welcome. Currently, I have scan.setCaching enabled which has helped a lot

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

睫毛溺水了 2025-01-12 23:19:59

不,添加节点不会加快扫描速度。 HBase 扫描是串行的有几个原因。

当您进行类似 HTable.getScanner(scan) 的调用时,返回的是 Result 对象的迭代器 - 在调用 next() 时code> 项,HBase 实际上正在使用扫描的参数对下一行执行另一个类似 Get 的查询。 Scan 对象本身所做的就是生成行键列表并提供一个迭代器,您可以使用它来移动它们(它实际上在缓存和确定行键存在于哪些区域方面做了更多的工作) ,但我们可以忽略这一点)。

除了 HBase 中 Scan 的实际机制之外,还有 regions 作为在磁盘上物理存储数据的底层架构。区域文件中最广泛的组织因素是列族。这是有道理的,因为它在获取同一列/族中的数据片段时可以减少开销。由于列族通常存在于一个区域(或一组区域,随着列族大小的增长)内,因此并行扫描的效果将是最小的,除非您对足够的行进行扫描以保证从多个区域进行读取,通常建议不要这样做(在某一点之后,使用映射/归约操作来收集数据集的信息并进行计算会变得很有用)。

No, adding nodes will not speed up a scan. HBase scans are serial for a couple of reasons.

When you make a call like this HTable.getScanner(scan) what is returned is an iterator of Result objects -- upon calling up the next() item, HBase is actually performing another Get-like query for the next row using the parameters of your scan. All the Scan object does itself is generate a list of row keys and provide an iterator with which you can move through them (it actually does a bit more regarding caching and figuring out which regions the row keys exist on, but we can neglect that).

Beyond the actual mechanisms of a Scan in HBase, there is the matter of regions as the underlying architecture for physically storing data on the disk. The broadest organizing factor in a region file is the column family. This makes sense, since it allows for less overhead when fetching pieces of data in the same column/family. Since column families typically exist within one region (or a set of regions, as the size of the column family grows), the effect of parallelizing a scan would be minimal unless you were doing a scan over enough rows to warrant reading from multiple regions, which is generally advised against (after a certain point, it becomes useful to use map/reduce operations to gather information on and compute over your data set).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文