为什么hive查询的结果会分成多个文件
我设置了一个 Amazon ElasticMapreduce 作业来运行 hive 查询
CREATE EXTERNAL TABLE output_dailies (
day string, type string, subType string, product string, productDetails string,
uniqueUsers int, totalUsers int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}';
INSERT OVERWRITE TABLE output_dailies
select day, type, subType, product, productDetails, count(distinct accountId) as uniqueUsers, count(accountId) as totalUsers from raw_logs where day = '${QUERY_DATE}' group by day, type, subType, product, productDetails;
作业完成后,配置为在 S3 上的输出位置将包含 5 个具有此模式 task_201110280815_0001_r_00000x
的文件,其中 x 从 0 到4. 文件很小,每个 35 KB。
是否可以指示 hive 将结果存储在单个文件中?
I have a Amazon ElasticMapreduce job set up to run hive query
CREATE EXTERNAL TABLE output_dailies (
day string, type string, subType string, product string, productDetails string,
uniqueUsers int, totalUsers int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}';
INSERT OVERWRITE TABLE output_dailies
select day, type, subType, product, productDetails, count(distinct accountId) as uniqueUsers, count(accountId) as totalUsers from raw_logs where day = '${QUERY_DATE}' group by day, type, subType, product, productDetails;
After the job finishes, the output location, which is configured to be on S3, will contain 5 files with this pattern task_201110280815_0001_r_00000x
where x goes from 0 to 4. The files are small, 35 KB each.
Is it possible to instruct hive to store the results in a single file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一般来说,这是可以实现的,但会损失一些可扩展性
尝试使用设置
“set mapred.reduce.tasks = 1;”
这会强制使用 1 个减速器,因此只会输出 1 个文件。
In general term yes this is achievable but with a loss of some scalability
Try using the setting
"set mapred.reduce.tasks = 1;"
This forces 1 reducer and therefore there will be only 1 file outputted.
它们是由不同的数据节点创建的。每个文件都附加到文件中 - 如果它们都必须附加到同一个文件,那么这将需要大量锁定并减慢速度。
您只需引用目录及其所有内容即可将多个文件视为一个大文件。
They are created by different data nodes. Each one is appending to the file - if they all had to append to the same file then this would require lots of locking and slow it down.
You can treat the multiple files as one big file by just referring to the directory and all its contents.