HBase 键值压缩?
感谢您对我的问题感兴趣。 在开始之前,我想让您知道我对 Hadoop 和 Hadoop 还很陌生。 HBase。到目前为止,我发现 Hadoop 非常有趣,并希望在未来做出更多贡献。
我主要对提高 HBase 的性能感兴趣。为此,我修改了 HBase 的 /io/hfile/Hfile.java
中的 Writer
方法,使其进行高速缓冲数据组装,然后直接写入Hadoop,以便稍后可以由 HBase 加载。
现在,我正在尝试想出一种压缩键值对的方法,以便节省带宽。我做了很多研究来弄清楚如何做到这一点;然后意识到HBase内置了压缩库。
我目前正在查看 SequenceFile (1); setCompressMapOutput (2)(已弃用);和类压缩(3)。我还在 Apache 的 MapReduce 上找到了教程。
有人可以解释一下“SequenceFile”是什么,以及我如何实现这些压缩库和算法?这些不同的类和文档让我很困惑。
我衷心感谢您的帮助。
--
超链接:
(1): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
(2): hadoop.apache.org/common/docs/current/api /org/apache/hadoop/mapred/JobConf.html#setCompressMapOutput%28boolean%29
(3): www.apache.org/dist/hbase/docs/apidocs/org/apache/hadoop/hbase/io/hfile/ Compression.html
Thanks for taking interest in my question.
Before I begin, I'd like to let you know that I'm very new to Hadoop & HBase. So far, I find Hadoop very interesting and would like to contribute more in the future.
I'm primarily interested in improving performance of HBase. To do so, I had modified Writer
methods in HBase's /io/hfile/Hfile.java
in a way that it does high-speed buffered data assembly and then directly write to Hadoop so that it can later be loaded by HBase.
Now, I'm trying to come up with a way to compress key-value pairs so that bandwidth could be saved. I've done a lot of research to figure out how; and then realized that HBase has built-in compression libraries.
I'm currently looking at SequenceFile (1); setCompressMapOutput (2) (deprecated); and Class Compression (3). I also found a tutorial on Apache's MapReduce.
Could someone explain what "SequenceFile" is, and how I can implement those compression libraries and algorithms? These different classes and documents are so confusing to me.
I'd sincerely appreciate your help.
--
Hyperlinks:
(1): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
(2): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setCompressMapOutput%28boolean%29
(3): www.apache.org/dist/hbase/docs/apidocs/org/apache/hadoop/hbase/io/hfile/Compression.html
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
SequenceFile
是 Hadoop 中实现的键/值对文件格式。尽管 HBase 中使用 SequenceFile 来存储预写日志,但 SequenceFile 的块压缩实现却并非如此。Compression 类是 Hadoop 压缩框架的一部分,因此用于 HBase 的 HFile 块压缩。
HBase 已经具有以下类型的内置压缩:
SequenceFile
的内置块压缩也不能用于预写日志,因为为块压缩缓冲键/值对会导致数据丢失。HBase RPC 压缩是一项正在进行的工作。正如您提到的,压缩客户端和 HBase 之间传递的键/值对可以节省带宽并提高 HBase 性能。这已在 Facebook 的 HBase 版本 0.89-fb (HBASE-5355 )但尚未移植到官方 Apache HBase 主干。 HBase 0.89-fb 中支持的RPC 压缩算法与Hadoop 压缩框架(例如GZIP 和LZO)支持的算法相同。
setCompressedMapOutput
方法是一种map-reduce 配置方法,与HBase 压缩没有真正的关系。SequenceFile
is a key/value pair file format implemented in Hadoop. Even thoughSequenceFile
is used in HBase for storing write-ahead logs,SequenceFile
's block compression implementation is not.The
Compression
class is part of Hadoop's compression framework and as such is used in HBase's HFile block compression.HBase already has built-in compression of the following types:
SequenceFile
's built-in block compression cannot be used for write-ahead log, because buffering key/value pairs for block compression would cause data loss.HBase RPC compression is a work in progress. As you mentioned, compressing key/value pairs passed between client and HBase can save bandwidth and improve HBase performance. This has been implemented in Facebook's version of HBase, 0.89-fb (HBASE-5355) but has yet to be ported to the official Apache HBase trunk. RPC compression algorithms supported in HBase 0.89-fb are the same as those supported by the Hadoop compression framework (e.g. GZIP and LZO).
The
setCompressedMapOutput
method is a map-reduce configuration method and is not really relevant to HBase compression.SequenceFile 是 Hadoop 使用的键/值对流。您可以在 Hadoop wiki 上阅读更多相关信息。
A SequenceFile is a stream of key/value pairs used by Hadoop. You can read more about it on the Hadoop wiki.