HBase 访问和索引
我有一个大约有 5000 万行的 HBase 表,每行有几列。我的目标是从表中检索在给定列中具有给定值的那些行,例如列“col_1”具有值“val_1”的行。
我有两个选项可供选择:
- 从头到尾扫描表,检查每一行,看看是否应该检索它;
- 为该表构建索引(例如,列“col_1”中的值的索引),然后对于给定的列值“val_1”,获取与该索引“val_1”关联的所有行键,然后遍历这些行键并检索相应的行。在我看来,这将涉及对原始 hbase 表的随机访问。
有人给我一些关于哪个选项运行速度更快的建议,或者您有另一个更好的选择吗?
多谢!
I have a HBase table with about 50 million rows and each row has several columns. My goal is to retrieve from the table those rows who have a given value in a given column, e.g. rows whose column 'col_1' has value 'val_1'.
I have two options to choose:
- scan through the table from the beginning to the end, and check each row and see if it should be retrieved or not;
- build indices for this table (e.g., indices for values in column 'col_1'), then for a given column value 'val_1', get all the row keys associated with this index 'val_1', then go through these row keys and retrieve the corresponding rows.This in my mind will involve random access to the original hbase table.
Does anyone give me some suggestions about which option runs faster, or you have another better option?
Thanks a lot!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你是想问加索引会不会更快?答案当然是肯定的。您可以查看wiki,了解有关 HBase 二级索引的想法。
Are you asking whether adding an index will make it faster? The answer is of course yes. You can see the wiki for thoughts on secondary indexes in HBase.
索引肯定比每次扫描 50M 行要快。如果您使用已经具有协处理器的 hbase 版本,您可以遵循 Xodarap 建议。如果您使用旧版本的 Hbase,则需要设置一个附加表作为索引并手动更新(每次更新主表或偶尔通过 map/reduce)
An index will surely work faster than scanning 50M rows every time. If you use an hbase version that already has coprocessors you can follow Xodarap advice. If you are using older versions of Hbase you need to setup an additional table to act as the index and update manually (either everytime you update the main table or occasionally via map/reduce)
二级索引会更快。您还可以尝试使用 culvert 等二级索引库,而不是创建自己的索引。
Secondary index will be faster. You can also try a secondary index library like culvert, instead of creating your own index.