HBase 键值压缩?

发布于 2024-11-16 11:33:08 字数 1023 浏览 2 评论 0原文

感谢您对我的问题感兴趣。 在开始之前,我想让您知道我对 Hadoop 和 Hadoop 还很陌生。 HBase。到目前为止,我发现 Hadoop 非常有趣,并希望在未来做出更多贡献。

我主要对提高 HBase 的性能感兴趣。为此,我修改了 HBase 的 /io/hfile/Hfile.java 中的 Writer 方法,使其进行高速缓冲数据组装,然后直接写入Hadoop,以便稍后可以由 HBase 加载。

现在,我正在尝试想出一种压缩键值对的方法,以便节省带宽。我做了很多研究来弄清楚如何做到这一点;然后意识到HBase内置了压缩库。

我目前正在查看 SequenceFile (1); setCompressMapOutput (2)(已弃用);和类压缩(3)。我还在 Apache 的 MapReduce 上找到了教程

有人可以解释一下“SequenceFile”是什么,以及我如何实现这些压缩库和算法?这些不同的类和文档让我很困惑。

我衷心感谢您的帮助。

--

超链接:

(1): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

(2): hadoop.apache.org/common/docs/current/api /org/apache/hadoop/mapred/JobConf.html#setCompressMapOutput%28boolean%29

(3): www.apache.org/dist/hbase/docs/apidocs/org/apache/hadoop/hbase/io/hfile/ Compression.html

Thanks for taking interest in my question.
Before I begin, I'd like to let you know that I'm very new to Hadoop & HBase. So far, I find Hadoop very interesting and would like to contribute more in the future.

I'm primarily interested in improving performance of HBase. To do so, I had modified Writer methods in HBase's /io/hfile/Hfile.java in a way that it does high-speed buffered data assembly and then directly write to Hadoop so that it can later be loaded by HBase.

Now, I'm trying to come up with a way to compress key-value pairs so that bandwidth could be saved. I've done a lot of research to figure out how; and then realized that HBase has built-in compression libraries.

I'm currently looking at SequenceFile (1); setCompressMapOutput (2) (deprecated); and Class Compression (3). I also found a tutorial on Apache's MapReduce.

Could someone explain what "SequenceFile" is, and how I can implement those compression libraries and algorithms? These different classes and documents are so confusing to me.

I'd sincerely appreciate your help.

--

Hyperlinks:

(1): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

(2): hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setCompressMapOutput%28boolean%29

(3): www.apache.org/dist/hbase/docs/apidocs/org/apache/hadoop/hbase/io/hfile/Compression.html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

白首有我共你 2024-11-23 11:33:08

SequenceFile 是 Hadoop 中实现的键/值对文件格式。尽管 HBase 中使用 SequenceFile 来存储预写日志,但 SequenceFile 的块压缩实现却并非如此。

Compression 类是 Hadoop 压缩框架的一部分,因此用于 HBase 的 HFile 块压缩。

HBase 已经具有以下类型的内置压缩:

  • 磁盘上的 HFile 块压缩。它使用Hadoop的编解码器框架,并支持LZO、GZIP和SNAPPY等压缩算法。这种类型的压缩仅适用于存储在磁盘上的 HFile 块,因为需要解压缩整个块才能检索键/值对。
  • 缓存内密钥压缩(在 HBase 术语中称为“数据块编码”)—请参阅 HBASE-4218 。已实现的编码算法包括各种类型的前缀和增量编码,并且在撰写本文时正在实现 trie 编码 (
  • HBASE-4608 中实现了基于自定义字典的预写日志压缩方法。注意:即使 SequenceFile 在 HBase 中用于预写日志存储,SequenceFile 的内置块压缩也不能用于预写日志,因为为块压缩缓冲键/值对会导致数据丢失。

HBase RPC 压缩是一项正在进行的工作。正如您提到的,压缩客户端和 HBase 之间传递的键/值对可以节省带宽并提高 HBase 性能。这已在 Facebook 的 HBase 版本 0.89-fb (HBASE-5355 )但尚未移植到官方 Apache HBase 主干。 HBase 0.89-fb 中支持的RPC 压缩算法与Hadoop 压缩框架(例如GZIP 和LZO)支持的算法相同。

setCompressedMapOutput 方法是一种map-reduce 配置方法,与HBase 压缩没有真正的关系。

SequenceFile is a key/value pair file format implemented in Hadoop. Even though SequenceFile is used in HBase for storing write-ahead logs, SequenceFile's block compression implementation is not.

The Compression class is part of Hadoop's compression framework and as such is used in HBase's HFile block compression.

HBase already has built-in compression of the following types:

  • HFile block compression on disk. This uses Hadoop's codec framework and supports compression algorithms such as LZO, GZIP, and SNAPPY. This type of compression is only applied to HFile blocks that are stored on disk, because the whole block needs to be uncompressed to retrieve key/value pairs.
  • In-cache key compression (called "data block encoding" in HBase terminology)—see HBASE-4218. Implemented encoding algorithms include various types of prefix and delta encoding, and trie encoding is being implemented as of this writing (HBASE-4676). Data block encoding algorithms take advantage of the redundancy between sorted keys in an HFile block and only store the differences between consecutive keys. These algorithms currently do not deal with values, and therefore are mostly useful for the case of small values (relative to key size), e.g. counters. Due to the light-weight nature of these data block encoding algorithms, it is possible to efficiently decode only the necessary part of the block to retrieve the requested key or advance to the next key. This is why these encoding algorithms are good for improving cache efficiency. However, on some real-world datasets delta encoding also allows to save up to 50% on top of LZO compression (e.g. applying delta encoding and then LZO vs. LZO only), thus achieving significant savings on disk as well.
  • A custom dictionary-based write-ahead log compression approach is implemented in HBASE-4608. Note: even though SequenceFile is used for write-ahead log storage in HBase, SequenceFile's built-in block compression cannot be used for write-ahead log, because buffering key/value pairs for block compression would cause data loss.

HBase RPC compression is a work in progress. As you mentioned, compressing key/value pairs passed between client and HBase can save bandwidth and improve HBase performance. This has been implemented in Facebook's version of HBase, 0.89-fb (HBASE-5355) but has yet to be ported to the official Apache HBase trunk. RPC compression algorithms supported in HBase 0.89-fb are the same as those supported by the Hadoop compression framework (e.g. GZIP and LZO).

The setCompressedMapOutput method is a map-reduce configuration method and is not really relevant to HBase compression.

左秋 2024-11-23 11:33:08

SequenceFile 是 Hadoop 使用的键/值对流。您可以在 Hadoop wiki 上阅读更多相关信息。

A SequenceFile is a stream of key/value pairs used by Hadoop. You can read more about it on the Hadoop wiki.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文