HDFS在存储时是否对数据进行加密或压缩?
例如,当我将文件放入 HDFS 时,
$ ./bin/hadoop/dfs -put /source/file input
- 文件在存储时是否压缩?
- 文件在存储时是否加密?是否有一个配置设置可以指定来更改它是否加密?
When I put a file into HDFS, for example
$ ./bin/hadoop/dfs -put /source/file input
- Is the file compressed while storing?
- Is the file encrypted while storing? Is there a config setting that we can specify to change whether it is encrypted or not?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
HDFS 中没有隐式压缩。换句话说,如果你希望你的数据被压缩,你就必须这样写。如果您计划编写 MapReduce 作业来处理压缩数据,您将需要使用可拆分压缩格式。
Hadoop 可以处理压缩文件,这里有一篇不错的 文章 就可以了。此外,中间和最终的 MR 输出可以压缩。
有一个关于“HDFS 中的透明压缩”的 JIRA,但我没有看到太多进展。
我认为没有单独的 API 用于加密,尽管您也可以使用压缩编解码器进行加密/解密。 此处是有关加密和 HDFS 的更多详细信息。
There is no implicit compression in HDFS. In other words, if you want your data to be compressed, you have to write it that way. If you plan on writing map reduce jobs to process the compressed data, you'll want to use a splittable compression format.
Hadoop can process compressed files and here is a nice article on it. Also, the intermediate and the final MR output can be compressed.
There is a JIRA on 'Transparent compression in HDFS', but I don't see much progress on it.
I don't think there is a separate API for encryption, though you can you use a compression codec for encryption/decryption also. Here are more details about encryption and HDFS.
我最近在集群上设置了压缩。其他帖子有有用的链接,但您想要让 LZO 压缩工作的实际代码在这里:https:// /github.com/kevinweil/hadoop-lzo。
您可以开箱即用地使用 GZIP 压缩、BZIP2 压缩和 Unix 压缩。只需上传其中一种格式的文件即可。当使用文件作为作业的输入时,您需要指定文件是否被压缩以及正确的编解码器。这是 LZO 压缩的示例。
为什么我要继续谈论 LZO 压缩? cloudera 文章 Praveen 的参考文献对此进行了阐述。 LZO 压缩是一种可分割的压缩(例如,与 GZIP 不同)。这意味着单个文件可以分成多个块,然后交给映射器。如果没有可分割的压缩文件,单个映射器将接收整个文件。这可能会导致您的映射器太少并且在网络中移动太多数据。
BZIP2 也是可分割的。它还具有比 LZO 更高的压缩率。然而,它非常慢。 LZO 的压缩比比 GZIP 差。然而,它被优化为 极快。事实上,它甚至可以通过最大限度地减少磁盘 I/O 来提高作业性能。
设置起来需要一些工作,使用起来有点痛苦,但这是值得的(透明加密会很棒)。再次,步骤是:
I very recently set compression up on a cluster. The other posts have helpful links, but the actual code you will want to get LZO compression working is here: https://github.com/kevinweil/hadoop-lzo.
You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload a file in one of those formats. When using the file as an input to a job, you will need to specify that the file is compressed as well as the proper CODEC. Here is an example for LZO compression.
Why am I going on an on about LZO compression? The cloudera article reference by Praveen goes into this. LZO compression is a splittable compression (unlike GZIP, for example). This means that a single file can be split into chunks to be handed off to a mapper. Without a splittable compressed file, a single mapper will receive the entire file. This may cause you to have too few mappers and to move too much data around your network.
BZIP2 is also splittable. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However it is optimized to be extremely fast. In fact, it can even increase the performance of your job by minimizing disk I/O.
It takes a bit of work to set up, and is a bit of a pain to use, but it is worth it (transparent encryption would be awesome). Once again, the steps are: