优化 Hadoop Hive 中的几 GB 数据
我现在已经相当广泛地使用 Hive,我想知道是否有办法改进以下工作流程。
每天晚上,Oracle 集群中的制表符分隔的未压缩文本文件转储都会写入 HDFS,由 Hive 进行处理。
我像这样加载表:
CREATE EXTERNAL TABLE ACCOUNTINGTABLE (
ts STRING,
duid STRING,
owner STRING,
hidden STRING,
lgroup STRING,
nbfiles INT,
length BIGINT,
replicas INT,
provenance STRING,
state STRING,
campaign STRING,
rlength BIGINT,
rnbfiles INT,
rowner STRING,
rgroup STRING,
rarchived STRING,
rsuspicious STRING,
name STRING,
ami STRING,
site STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/accounting/dump';
LOAD DATA INPATH '/user/accounting/dump_REPLACEWITHTIMESTAMP.lst' INTO TABLE ACCOUNTINGTABLE;
然后运行多个这样的记帐摘要以生成用于后处理的基于文本的输出:
set hive.exec.reducers.max=90;
CREATE EXTERNAL TABLE ACCOUNTINGTABLE_site_project_datatype_tag (
ts STRING,
project STRING,
datatype STRING,
tag STRING,
site STRING,
duids INT,
replicas INT,
nbfiles INT,
rnbfiles INT,
length BIGINT,
rlength BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/accounting/summary/REPLACEWITHTIMESTAMP/site_project_datatype_tag';
INSERT OVERWRITE TABLE ACCOUNTINGTABLE_site_project_datatype_tag
SELECT
'REPLACEWITHTIMESTAMP',
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){1}', 1),
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){5}', 1),
split(regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){6}', 1), '_tid')[0],
site,
count(distinct duid),
sum(replicas),
sum(nbfiles),
sum(rnbfiles),
sum(length),
sum(rlength)
from
ACCOUNTINGTABLE
where
(
ami='project.datasetnumber.physicsshort.prodstep.datatype.version'
or
ami='project.runnumber.streamname.prodstep.datatype.version'
)
group by
'REPLACEWITHTIMESTAMP',
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){1}', 1),
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){5}', 1),
split(regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){6}', 1), '_tid')[0],
site;
DROP TABLE ACCOUNTINGTABLE_site_project_datatype_tag;
现在:
Oracle 转储的平均文件大小约为 5GB(所以并不是那么多),大约有 2.5 亿行。摘要不大于 1-2 MB。
上面提到的 Hive 作业平均需要大约一个小时才能运行。映射阶段进展得非常好,大约 15 分钟后就达到了 100%,但随后减少花费了几乎 45 分钟,始终显示 100%。 现在,我们逐渐添加越来越多的不同摘要,很快我们将达到摘要处理 24 小时的神奇限制。我们的基础设施监控还显示节点利用率较低(cpu ~30-40%,io ~10%)。
我尝试过使用 io.sort.mb、io.sort.factor 等...但这几乎总是让事情变得更糟。所以现在我正在运行 Hadoop 默认值(Cloudera 发行版顺便说一句)。该集群有 12 个节点(8 核),每个节点有 24GB 内存和 2TB 磁盘,配置为 8 个映射器、8 个缩减器(名称节点上有 6/6)。
我还尝试使用 INSERT INTO SELECT 创建一个临时表作为压缩序列文件,但是 INSERT 花费了太长时间...
我怀疑工作流程本身可能有问题,而不仅仅是集群/配置。
如有任何建议,我们将不胜感激。
I've been using Hive quite extensively now, and I was wondering if there was a way to improve the following workflow.
Every night, a tab-delimited uncompressed textfile dump from our Oracle cluster is written to HDFS, to be processed by Hive.
I load the table like this:
CREATE EXTERNAL TABLE ACCOUNTINGTABLE (
ts STRING,
duid STRING,
owner STRING,
hidden STRING,
lgroup STRING,
nbfiles INT,
length BIGINT,
replicas INT,
provenance STRING,
state STRING,
campaign STRING,
rlength BIGINT,
rnbfiles INT,
rowner STRING,
rgroup STRING,
rarchived STRING,
rsuspicious STRING,
name STRING,
ami STRING,
site STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/accounting/dump';
LOAD DATA INPATH '/user/accounting/dump_REPLACEWITHTIMESTAMP.lst' INTO TABLE ACCOUNTINGTABLE;
and then run multiple accounting summaries like this to produce text-based output for postprocessing:
set hive.exec.reducers.max=90;
CREATE EXTERNAL TABLE ACCOUNTINGTABLE_site_project_datatype_tag (
ts STRING,
project STRING,
datatype STRING,
tag STRING,
site STRING,
duids INT,
replicas INT,
nbfiles INT,
rnbfiles INT,
length BIGINT,
rlength BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/accounting/summary/REPLACEWITHTIMESTAMP/site_project_datatype_tag';
INSERT OVERWRITE TABLE ACCOUNTINGTABLE_site_project_datatype_tag
SELECT
'REPLACEWITHTIMESTAMP',
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){1}', 1),
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){5}', 1),
split(regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){6}', 1), '_tid')[0],
site,
count(distinct duid),
sum(replicas),
sum(nbfiles),
sum(rnbfiles),
sum(length),
sum(rlength)
from
ACCOUNTINGTABLE
where
(
ami='project.datasetnumber.physicsshort.prodstep.datatype.version'
or
ami='project.runnumber.streamname.prodstep.datatype.version'
)
group by
'REPLACEWITHTIMESTAMP',
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){1}', 1),
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){5}', 1),
split(regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){6}', 1), '_tid')[0],
site;
DROP TABLE ACCOUNTINGTABLE_site_project_datatype_tag;
Now:
The average file size of an Oracle dump is about 5GB (so not really that much), with about 250 million rows. The summaries are not larger than 1-2 MB.
The average Hive job as mentioned above then takes about an hour to run. The mapping phase progresses really well, and is at 100% after about 15 minutes, but then reduce takes almost 45 minutes all the time showing 100%.
Now we've gradually been adding more and more different summaries and soon we will hit the magic limit of 24 hours for summary processing. Our infrastructure monitoring also shows that the node utilisation is low (cpu ~30-40%, io ~10%).
I've tried playing with io.sort.mb, io.sort.factor, etc... but that almost always made things worse. So now I'm running Hadoop defaults (Cloudera distribution btw). The cluster is 12 nodes (8 core), with 24GB ram and 2TB disk each, configured for 8 mappers, 8 reducers (6/6 on the namenode).
I also tried creating a temporary table as a compressed sequencefile with an INSERT INTO SELECT, but that INSERT just took too long...
I have the suspicion that there might be something wrong with the workflow itself, and not only the cluster/configuration.
Any advice is gladly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您是否尝试过分区或索引
分区可以极大地加快分组速度,因为数据按分区在节点上进行物理划分,因此减少操作大大减少,并且在某些情况下是即时的。
Have you tried partitions or indexes
Partition can greatly speed up group by as the data is physically divided over the nodes by partition and therefore the reduce operations are greatly reduced and in some cases instant.
除了分区之外,您还可以进行分桶,并且可以在其中指定排序依据。
检查创建视图。这可能比没有分区的表更有帮助。
Along with Partitioning, you can do Bucketing and you can specify Sort By in it.
Check creating Views. That might help too than having table without partitions.