如何让hive同时运行mapreduce作业?
我是 hive 的新手,遇到了一个问题,
我在 hive 中有一个像这样的表:
create table td(id int, time string, ip string, v1 bigint, v2 int, v3 int,
v4 int, v5 bigint, v6 int) PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' lines TERMINATED BY '\n' ;
我运行一个像这样的 sql:
from td
INSERT OVERWRITE DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE DIRECTORY '/tmp/totaldistinct.out' select count(distinct v1)
INSERT OVERWRITE DIRECTORY '/tmp/distinctuin.out' select distinct v1
INSERT OVERWRITE DIRECTORY '/tmp/v4.out' select v4 , count(v1), count(distinct v1) group by v4
INSERT OVERWRITE DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1), count(distinct v1) group by v3, v4
INSERT OVERWRITE DIRECTORY '/tmp/v426.out' select count(v1), count(distinct v1) where v4=2 or v4=6
INSERT OVERWRITE DIRECTORY '/tmp/v3v426.out' select v3, count(v1), count(distinct v1) where v4=2 or v4=6 group by v3
INSERT OVERWRITE DIRECTORY '/tmp/v415.out' select count(v1), count(distinct v1) where v4=1 or v4=5
INSERT OVERWRITE DIRECTORY '/tmp/v3v415.out' select v3, count(v1), count(distinct v1) where v4=1 or v4=5 group by v3
它有效,输出结果是我想要的。
但有一个问题,hive 生成 9 个 MapReduce 作业并一一运行这些作业。
我对此查询运行解释,并收到以下消息:
STAGE DEPENDENCIES:
Stage-9 is a root stage
Stage-0 depends on stages: Stage-9
Stage-10 depends on stages: Stage-9
Stage-1 depends on stages: Stage-10
Stage-11 depends on stages: Stage-9
Stage-2 depends on stages: Stage-11
Stage-12 depends on stages: Stage-9
Stage-3 depends on stages: Stage-12
Stage-13 depends on stages: Stage-9
Stage-4 depends on stages: Stage-13
Stage-14 depends on stages: Stage-9
Stage-5 depends on stages: Stage-14
Stage-15 depends on stages: Stage-9
Stage-6 depends on stages: Stage-15
Stage-16 depends on stages: Stage-9
Stage-7 depends on stages: Stage-16
Stage-17 depends on stages: Stage-9
Stage-8 depends on stages: Stage-17
似乎阶段 9-17 对应于 mapreduce 作业 0-8
但从上面的解释消息来看,阶段 10-17 仅取决于阶段 9,
所以我有一个问题,为什么作业1-8不能同时运行?
或者如何让作业 1-8 同时运行?
非常感谢您的帮助!
I'm new to hive and I have encountered a problem,
I have a table in hive like this:
create table td(id int, time string, ip string, v1 bigint, v2 int, v3 int,
v4 int, v5 bigint, v6 int) PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' lines TERMINATED BY '\n' ;
And I run an sql like:
from td
INSERT OVERWRITE DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE DIRECTORY '/tmp/totaldistinct.out' select count(distinct v1)
INSERT OVERWRITE DIRECTORY '/tmp/distinctuin.out' select distinct v1
INSERT OVERWRITE DIRECTORY '/tmp/v4.out' select v4 , count(v1), count(distinct v1) group by v4
INSERT OVERWRITE DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1), count(distinct v1) group by v3, v4
INSERT OVERWRITE DIRECTORY '/tmp/v426.out' select count(v1), count(distinct v1) where v4=2 or v4=6
INSERT OVERWRITE DIRECTORY '/tmp/v3v426.out' select v3, count(v1), count(distinct v1) where v4=2 or v4=6 group by v3
INSERT OVERWRITE DIRECTORY '/tmp/v415.out' select count(v1), count(distinct v1) where v4=1 or v4=5
INSERT OVERWRITE DIRECTORY '/tmp/v3v415.out' select v3, count(v1), count(distinct v1) where v4=1 or v4=5 group by v3
it works, and the output result is what I want.
but there is one problem, hive generate 9 mapreduce jobs and run these jobs one by one.
I run explain on this query, and I got the following message:
STAGE DEPENDENCIES:
Stage-9 is a root stage
Stage-0 depends on stages: Stage-9
Stage-10 depends on stages: Stage-9
Stage-1 depends on stages: Stage-10
Stage-11 depends on stages: Stage-9
Stage-2 depends on stages: Stage-11
Stage-12 depends on stages: Stage-9
Stage-3 depends on stages: Stage-12
Stage-13 depends on stages: Stage-9
Stage-4 depends on stages: Stage-13
Stage-14 depends on stages: Stage-9
Stage-5 depends on stages: Stage-14
Stage-15 depends on stages: Stage-9
Stage-6 depends on stages: Stage-15
Stage-16 depends on stages: Stage-9
Stage-7 depends on stages: Stage-16
Stage-17 depends on stages: Stage-9
Stage-8 depends on stages: Stage-17
it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?
Or how can I make job 1-8 run concurrently?
Thank you very much for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 hive-default.xml 中,有一个名为“hive.exec.parallel”的属性,可以启用并行执行作业。默认值为“假”。您可以将其更改为“true”以获得此能力。您可以使用另一个属性“hive.exec.parallel.thread.number”来控制最多可以并行执行多少个作业。
有关更多详细信息:https://issues.apache.org/jira/browse/HIVE-549
In hive-default.xml, there is a property named "hive.exec.parallel" which could enable execute job in parallel. The default value is "false". You can change it to "true" to acquire this ability. You can use another property "hive.exec.parallel.thread.number" to control how many jobs at most can be executed in parallel.
For more details: https://issues.apache.org/jira/browse/HIVE-549