为什么hive查询的结果会分成多个文件

发布于 2024-12-12 12:42:01 字数 755 浏览 0 评论 0原文

我设置了一个 Amazon ElasticMapreduce 作业来运行 hive 查询

CREATE EXTERNAL TABLE output_dailies (
day string, type string, subType string, product string, productDetails string, 
uniqueUsers int, totalUsers int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}';

INSERT OVERWRITE TABLE output_dailies
select day, type, subType, product, productDetails, count(distinct accountId) as uniqueUsers, count(accountId) as totalUsers from raw_logs where day = '${QUERY_DATE}' group by day, type, subType, product, productDetails;

作业完成后,配置为在 S3 上的输出位置将包含 5 个具有此模式 task_201110280815_0001_r_00000x 的文件,其中 x 从 0 到4. 文件很小,每个 35 KB。

是否可以指示 hive 将结果存储在单个文件中?

I have a Amazon ElasticMapreduce job set up to run hive query

CREATE EXTERNAL TABLE output_dailies (
day string, type string, subType string, product string, productDetails string, 
uniqueUsers int, totalUsers int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}';

INSERT OVERWRITE TABLE output_dailies
select day, type, subType, product, productDetails, count(distinct accountId) as uniqueUsers, count(accountId) as totalUsers from raw_logs where day = '${QUERY_DATE}' group by day, type, subType, product, productDetails;

After the job finishes, the output location, which is configured to be on S3, will contain 5 files with this pattern task_201110280815_0001_r_00000x where x goes from 0 to 4. The files are small, 35 KB each.

Is it possible to instruct hive to store the results in a single file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

━╋う一瞬間旳綻放 2024-12-19 12:42:01

一般来说,这是可以实现的,但会损失一些可扩展性

尝试使用设置

“set mapred.reduce.tasks = 1;”

这会强制使用 1 个减速器,因此只会输出 1 个文件。

In general term yes this is achievable but with a loss of some scalability

Try using the setting

"set mapred.reduce.tasks = 1;"

This forces 1 reducer and therefore there will be only 1 file outputted.

2024-12-19 12:42:01

它们是由不同的数据节点创建的。每个文件都附加到文件中 - 如果它们都必须附加到同一个文件,那么这将需要大量锁定并减慢速度。

您只需引用目录及其所有内容即可将多个文件视为一个大文件。

They are created by different data nodes. Each one is appending to the file - if they all had to append to the same file then this would require lots of locking and slow it down.

You can treat the multiple files as one big file by just referring to the directory and all its contents.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文