HDFS在存储时是否对数据进行加密或压缩？

发布于 2024-12-05 06:25:45 字数 204 浏览 1 评论 0原文

例如，当我将文件放入 HDFS 时，

$ ./bin/hadoop/dfs -put /source/file input

文件在存储时是否压缩？
文件在存储时是否加密？是否有一个配置设置可以指定来更改它是否加密？

原文

When I put a file into HDFS, for example

$ ./bin/hadoop/dfs -put /source/file input

Is the file compressed while storing?
Is the file encrypted while storing? Is there a config setting that we can specify to change whether it is encrypted or not?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掩于岁月 2024-12-12 06:25:45

HDFS 中没有隐式压缩。换句话说，如果你希望你的数据被压缩，你就必须这样写。如果您计划编写 MapReduce 作业来处理压缩数据，您将需要使用可拆分压缩格式。

Hadoop 可以处理压缩文件，这里有一篇不错的文章就可以了。此外，中间和最终的 MR 输出可以压缩。

有一个关于“HDFS 中的透明压缩”的 JIRA，但我没有看到太多进展。

我认为没有单独的 API 用于加密，尽管您也可以使用压缩编解码器进行加密/解密。此处是有关加密和 HDFS 的更多详细信息。

回复收藏 0 原文

弱骨蛰伏 2024-12-12 06:25:45

我最近在集群上设置了压缩。其他帖子有有用的链接，但您想要让 LZO 压缩工作的实际代码在这里：https:// /github.com/kevinweil/hadoop-lzo。

您可以开箱即用地使用 GZIP 压缩、BZIP2 压缩和 Unix 压缩。只需上传其中一种格式的文件即可。当使用文件作为作业的输入时，您需要指定文件是否被压缩以及正确的编解码器。这是 LZO 压缩的示例。

  -jobconf mapred.output.compress=true
  -jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

为什么我要继续谈论 LZO 压缩？ cloudera 文章 Praveen 的参考文献对此进行了阐述。 LZO 压缩是一种可分割的压缩（例如，与 GZIP 不同）。这意味着单个文件可以分成多个块，然后交给映射器。如果没有可分割的压缩文件，单个映射器将接收整个文件。这可能会导致您的映射器太少并且在网络中移动太多数据。

BZIP2 也是可分割的。它还具有比 LZO 更高的压缩率。然而，它非常慢。 LZO 的压缩比比 GZIP 差。然而，它被优化为极快。事实上，它甚至可以通过最大限度地减少磁盘 I/O 来提高作业性能。

设置起来需要一些工作，使用起来有点痛苦，但这是值得的（透明加密会很棒）。再次，步骤是：

安装 LZO 和 LZOP（命令行实用程序）
安装 hadoop-lzo
上传使用 LZOP 压缩的文件。
按照 hadoop-lzo wiki 的描述对文件进行索引（索引允许将其拆分）。
运行您的作业（使用正确的参数mapred.output.compress和mapred.output.compression.code）

I very recently set compression up on a cluster. The other posts have helpful links, but the actual code you will want to get LZO compression working is here: https://github.com/kevinweil/hadoop-lzo.

You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload a file in one of those formats. When using the file as an input to a job, you will need to specify that the file is compressed as well as the proper CODEC. Here is an example for LZO compression.

  -jobconf mapred.output.compress=true
  -jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

Why am I going on an on about LZO compression? The cloudera article reference by Praveen goes into this. LZO compression is a splittable compression (unlike GZIP, for example). This means that a single file can be split into chunks to be handed off to a mapper. Without a splittable compressed file, a single mapper will receive the entire file. This may cause you to have too few mappers and to move too much data around your network.

BZIP2 is also splittable. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However it is optimized to be extremely fast. In fact, it can even increase the performance of your job by minimizing disk I/O.

It takes a bit of work to set up, and is a bit of a pain to use, but it is worth it (transparent encryption would be awesome). Once again, the steps are:

Install LZO and LZOP (command-line utility)
Install hadoop-lzo
Upload a file compressed with LZOP.
Index the file as described by hadoop-lzo wiki (the index allows it to be split).
Run your job (with the proper parameters mapred.output.compress and mapred.output.compression.code)

回复收藏 0 原文

~没有更多了~