Hadoop PIG 输出未使用 PARALLEL 运算符拆分为多个文件
看来我错过了一些东西。我的数据上的减速器数量在 HDFS 中创建了那么多数量的文件,但我的数据没有分割成多个文件。我注意到,如果我对按顺序排列的键执行group by
,它就可以正常工作,就像下面的数据根据键很好地分割成两个文件一样:
1 hello
2 bla
1 hi
2 works
2 end
但是这个数据并不split:
1 hello
3 bla
1 hi
3 works
3 end
我使用的代码对其中一个工作正常,但对另一个不适用,
InputData = LOAD 'above_data.txt';
GroupReq = GROUP InputData BY $0 PARALLEL 2;
FinalOutput = FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();
上面的代码创建了两个输出部分文件,但在第一个输入中,它很好地分割了数据,并将密钥 1
放入 part-r-00000
和 part-r-00001
中的密钥 2
。但对于第二个输入,它会创建两个部分文件,但所有数据最终都在 part-r-00000
中。我缺少什么,我能做什么来强制数据根据唯一键分割成多个输出文件?
注意:对于第二个输入,如果我使用 PARALLEL 3
(3 个减速器),它会创建三个部分文件,并添加 part-0 中键
以及 1
的所有数据part-3
文件中密钥 3
的所有数据。我发现这种行为很奇怪。顺便说一句,我正在使用 Cloudera CDH3B4。
Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by
on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key:
1 hello
2 bla
1 hi
2 works
2 end
But this data doesn't split:
1 hello
3 bla
1 hi
3 works
3 end
The code that I used that works fine for one and not for the other is
InputData = LOAD 'above_data.txt';
GroupReq = GROUP InputData BY $0 PARALLEL 2;
FinalOutput = FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();
The above code creates two output part files but in first input it splits the data nicely and put the key 1
in part-r-00000
and key 2
in part-r-00001
. But for the second input it creates two part files but all the data ends up in part-r-00000
. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys?
Note: for the second input if I use PARALLEL 3
(3 reducers), it creates three part files and add all the data for key 1
in part-0
and all the data for key 3
in part-3
file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是因为 key 所前往的 reducer 的编号由
hash(key) % reducersCount
确定。如果键是整数,则hash(key) == key
。当您拥有更多数据时,它们的分布或多或少会均匀,因此您不必担心。That's because the number of the reducer that a key goes to is determined as
hash(key) % reducersCount
. If the key is an integer,hash(key) == key
. When you have more data, they will be distributed more or less evenly, so you shouldn't worry about it.