为 Hive 中的 INSERT OVERWRITE SELECT 指定压缩编解码器
我有一个像 To populate 这样的配置单元表
CREATE TABLE beacons
(
foo string,
bar string,
foonotbar string
)
COMMENT "Digest of daily beacons, by day"
PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );
,我正在做类似的事情:
SET hive.exec.compress.output=True;
SET io.seqfile.compression.type=BLOCK;
INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
someFunc(query, "foo") as foo,
someFunc(query, "bar") as bar,
otherFunc(query, "foo||bar") as foonotbar
)
FROM raw_logs
WHERE day = "2011-01-26";
这会使用通过 deflate 压缩的各个产品构建一个新分区,但这里理想的情况是通过 LZO 压缩编解码器。
不幸的是,我不太确定如何实现这一点,但我认为它是许多运行时设置之一,或者可能只是 CREATE TABLE DDL 中的附加行。
I have a hive table like
CREATE TABLE beacons
(
foo string,
bar string,
foonotbar string
)
COMMENT "Digest of daily beacons, by day"
PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );
To populate, I am doing something like:
SET hive.exec.compress.output=True;
SET io.seqfile.compression.type=BLOCK;
INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
someFunc(query, "foo") as foo,
someFunc(query, "bar") as bar,
otherFunc(query, "foo||bar") as foonotbar
)
FROM raw_logs
WHERE day = "2011-01-26";
This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.
Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 INSERT OVERWRITE 之前添加以下运行时配置值:
另外,通过检查确保您具有所需的压缩编解码器:
有关 io.seqfile.compression.type 的更多信息可以在此处找到 http://wiki.apache.org/hadoop/Hive/CompressedStorage
我可能错了,但似乎 BLOCK 类型可以确保压缩较大的文件与较小的一组较低压缩文件相比,具有较高的比率。
Before the INSERT OVERWRITE prepend with the following runtime configuration values:
Also make sure you have the desired compression codec by checking:
Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage
I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.