HDFS产生的Hive提取太慢,因为太多的映射任务,当执行Hive SQL查询时,我该如何合并查询结果

发布于 2025-01-28 04:15:28 字数 2696 浏览 3 评论 0原文

Hive查询在“/tmp/hive/hive”中产生太多结果文件,接近4W任务。 因此,我想知道是否有一种方法可以在查询之后合并结果,减少结果文件的数量并提高提取结果的效率?

这是查询的解释

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: kafka_program_log             |
|             filterExpr: ((msg like '%disk loss%') and (ds > '2022-05-01')) (type: boolean) |
|             Statistics: Num rows: 36938084350 Data size: 11081425337136 Basic stats: PARTIAL Column stats: PARTIAL |
|             Filter Operator                        |
|               predicate: (msg like '%disk loss%') (type: boolean) |
|               Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
|               Select Operator                      |
|                 expressions: server (type: string), msg (type: string), ts (type: string), ds (type: string), h (type: string) |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|                 Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.TextInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+--+

hive query produces too many result files in the fold of "/tmp/hive/hive", Close to 4W tasks.But the total number of running results is only more than 100
so I wonder if there is a way to merge the results after query, reduce the number of result files, and improve the efficiency of pulling results?

Here is the explain of the query

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: kafka_program_log             |
|             filterExpr: ((msg like '%disk loss%') and (ds > '2022-05-01')) (type: boolean) |
|             Statistics: Num rows: 36938084350 Data size: 11081425337136 Basic stats: PARTIAL Column stats: PARTIAL |
|             Filter Operator                        |
|               predicate: (msg like '%disk loss%') (type: boolean) |
|               Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
|               Select Operator                      |
|                 expressions: server (type: string), msg (type: string), ts (type: string), ds (type: string), h (type: string) |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|                 Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.TextInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+--+

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

还在原地等你 2025-02-04 04:15:28

设置mapred.max.split.size = 2560000000;
增加单个地图处理的文件的大小,从而减少地图的数量

set mapred.max.split.size=2560000000;
Increase the size of the file processed by a single map, thereby reducing the number of maps

断爱 2025-02-04 04:15:28
  1. 使用ORC/Parquet重新创建表,您将获得更好的性能。这是您加速速度的1号优先级。
  2. 您正在使用类似的操作员,该操作员意味着扫描所有数据。您可能需要考虑,重写它以使用JOIN/WHERE子句。这将运行得更快。这是您可以做些什么才能使事情变得更好的例子。
    with words as --short cut for readable sub-query
    (
      select 
        log.msg 
      from 
        kafka_program_log log 
      lateral view EXPLODE(split(msg, ' ')) words as word  -- for each word in msg, make a row assumes ' disk loss ' is what is in the msg
      where 
        word in ('disk', 'loss' ) -- filter the words to the ones we care about.
      and 
        ds > '2022-05-01' -- filter dates to the ones we care about.
      group by 
        log.msg -- gather the msgs together
      having 
        count(word) >= 2  -- only pull back msg that have at least two words we are interested in.
    ) -- end sub-query
      select 
        * 
      from kafka_program_log log
      inner join 
        words.msg = log.msg  // This join should really reduce the data we examine
      where 
        msg like "%disk loss%" -- like is fine now to make sure it's exactly what we're looking for.
  1. Recreate the table using ORC/Parquet and you'll get much better performance. This is your number 1 priority for speeding things up.
  2. You are using a like operator that means scanning all the data. You may want to consider, re-writing it to use a join/where clause instead. This will run much faster. Here's an example of what you could do to make things better.
    with words as --short cut for readable sub-query
    (
      select 
        log.msg 
      from 
        kafka_program_log log 
      lateral view EXPLODE(split(msg, ' ')) words as word  -- for each word in msg, make a row assumes ' disk loss ' is what is in the msg
      where 
        word in ('disk', 'loss' ) -- filter the words to the ones we care about.
      and 
        ds > '2022-05-01' -- filter dates to the ones we care about.
      group by 
        log.msg -- gather the msgs together
      having 
        count(word) >= 2  -- only pull back msg that have at least two words we are interested in.
    ) -- end sub-query
      select 
        * 
      from kafka_program_log log
      inner join 
        words.msg = log.msg  // This join should really reduce the data we examine
      where 
        msg like "%disk loss%" -- like is fine now to make sure it's exactly what we're looking for.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文