为 Hive 中的 INSERT OVERWRITE SELECT 指定压缩编解码器

发布于 2024-10-14 22:56:41 字数 741 浏览 3 评论 0原文

我有一个像 To populate 这样的配置单元表

  CREATE TABLE beacons
 (
     foo string,
     bar string,
     foonotbar string
 )
 COMMENT "Digest of daily beacons, by day"
 PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );

，我正在做类似的事情：

 SET hive.exec.compress.output=True;
 SET io.seqfile.compression.type=BLOCK;

 INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
   someFunc(query, "foo") as foo,
   someFunc(query, "bar") as bar,
   otherFunc(query, "foo||bar") as foonotbar
   )
  FROM raw_logs
WHERE day = "2011-01-26";

这会使用通过 deflate 压缩的各个产品构建一个新分区，但这里理想的情况是通过 LZO 压缩编解码器。

不幸的是，我不太确定如何实现这一点，但我认为它是许多运行时设置之一，或者可能只是 CREATE TABLE DDL 中的附加行。

原文

I have a hive table like

  CREATE TABLE beacons
 (
     foo string,
     bar string,
     foonotbar string
 )
 COMMENT "Digest of daily beacons, by day"
 PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );

To populate, I am doing something like:

 SET hive.exec.compress.output=True;
 SET io.seqfile.compression.type=BLOCK;

 INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
   someFunc(query, "foo") as foo,
   someFunc(query, "bar") as bar,
   otherFunc(query, "foo||bar") as foonotbar
   )
  FROM raw_logs
WHERE day = "2011-01-26";

This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.

Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不奢求什么 2024-10-21 22:56:41

在 INSERT OVERWRITE 之前添加以下运行时配置值：

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

另外，通过检查确保您具有所需的压缩编解码器：

io.compression.codecs

有关 io.seqfile.compression.type 的更多信息可以在此处找到 http://wiki.apache.org/hadoop/Hive/CompressedStorage

我可能错了，但似乎 BLOCK 类型可以确保压缩较大的文件与较小的一组较低压缩文件相比，具有较高的比率。

Before the INSERT OVERWRITE prepend with the following runtime configuration values:

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

Also make sure you have the desired compression codec by checking: