HDFS产生的Hive提取太慢,因为太多的映射任务,当执行Hive SQL查询时,我该如何合并查询结果
Hive查询在“/tmp/hive/hive”中产生太多结果文件,接近4W任务。 因此,我想知道是否有一种方法可以在查询之后合并结果,减少结果文件的数量并提高提取结果的效率?
这是查询的解释
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: kafka_program_log |
| filterExpr: ((msg like '%disk loss%') and (ds > '2022-05-01')) (type: boolean) |
| Statistics: Num rows: 36938084350 Data size: 11081425337136 Basic stats: PARTIAL Column stats: PARTIAL |
| Filter Operator |
| predicate: (msg like '%disk loss%') (type: boolean) |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: server (type: string), msg (type: string), ts (type: string), ds (type: string), h (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
hive query produces too many result files in the fold of "/tmp/hive/hive", Close to 4W tasks.But the total number of running results is only more than 100
so I wonder if there is a way to merge the results after query, reduce the number of result files, and improve the efficiency of pulling results?
Here is the explain of the query
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: kafka_program_log |
| filterExpr: ((msg like '%disk loss%') and (ds > '2022-05-01')) (type: boolean) |
| Statistics: Num rows: 36938084350 Data size: 11081425337136 Basic stats: PARTIAL Column stats: PARTIAL |
| Filter Operator |
| predicate: (msg like '%disk loss%') (type: boolean) |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: server (type: string), msg (type: string), ts (type: string), ds (type: string), h (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
设置mapred.max.split.size = 2560000000;
增加单个地图处理的文件的大小,从而减少地图的数量
set mapred.max.split.size=2560000000;
Increase the size of the file processed by a single map, thereby reducing the number of maps