如果在超过 1 台机器上运行 HBase,在 Hbase 上运行扫描是否会更快?
我需要对 HBase 表进行扫描以进行即席查询。目前我只使用一个节点。我想知道在一台以上的机器上以分布式模式运行 HBase 是否会使其速度更快。目前,在 m1.large EC2 机器上扫描 300 万行大约需要 5 分钟。 欢迎任何有关如何加快扫描速度的想法。目前,我启用了 scan.setCaching 这有很大帮助
I need to do a scan on HBase table for my adhoc queries. Currently I'm using just a single node. I was wondering if running HBase in distributed mode on more than 1 machine might make it faster. It currently takes around 5 mins to do a scan on 3 million rows on a m1.large EC2 machine.
Any ideas on how to make scan faster are welcome. Currently, I have scan.setCaching enabled which has helped a lot
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不,添加节点不会加快扫描速度。 HBase 扫描是串行的有几个原因。
当您进行类似
HTable.getScanner(scan)
的调用时,返回的是Result
对象的迭代器 - 在调用next()
时code> 项,HBase 实际上正在使用扫描的参数对下一行执行另一个类似 Get 的查询。 Scan 对象本身所做的就是生成行键列表并提供一个迭代器,您可以使用它来移动它们(它实际上在缓存和确定行键存在于哪些区域方面做了更多的工作) ,但我们可以忽略这一点)。除了 HBase 中
Scan
的实际机制之外,还有 regions 作为在磁盘上物理存储数据的底层架构。区域文件中最广泛的组织因素是列族。这是有道理的,因为它在获取同一列/族中的数据片段时可以减少开销。由于列族通常存在于一个区域(或一组区域,随着列族大小的增长)内,因此并行扫描的效果将是最小的,除非您对足够的行进行扫描以保证从多个区域进行读取,通常建议不要这样做(在某一点之后,使用映射/归约操作来收集数据集的信息并进行计算会变得很有用)。No, adding nodes will not speed up a scan. HBase scans are serial for a couple of reasons.
When you make a call like this
HTable.getScanner(scan)
what is returned is an iterator ofResult
objects -- upon calling up thenext()
item, HBase is actually performing another Get-like query for the next row using the parameters of your scan. All theScan
object does itself is generate a list of row keys and provide an iterator with which you can move through them (it actually does a bit more regarding caching and figuring out which regions the row keys exist on, but we can neglect that).Beyond the actual mechanisms of a
Scan
in HBase, there is the matter of regions as the underlying architecture for physically storing data on the disk. The broadest organizing factor in a region file is the column family. This makes sense, since it allows for less overhead when fetching pieces of data in the same column/family. Since column families typically exist within one region (or a set of regions, as the size of the column family grows), the effect of parallelizing a scan would be minimal unless you were doing a scan over enough rows to warrant reading from multiple regions, which is generally advised against (after a certain point, it becomes useful to use map/reduce operations to gather information on and compute over your data set).