Hadoop PIG 输出未使用 PARALLEL 运算符拆分为多个文件

发布于 2024-10-30 15:22:04 字数 1022 浏览 10 评论 0原文

看来我错过了一些东西。我的数据上的减速器数量在 HDFS 中创建了那么多数量的文件，但我的数据没有分割成多个文件。我注意到，如果我对按顺序排列的键执行group by，它就可以正常工作，就像下面的数据根据键很好地分割成两个文件一样：

1    hello
2    bla     
1    hi
2    works
2    end

但是这个数据并不split：

1    hello
3    bla     
1    hi
3    works
3    end

我使用的代码对其中一个工作正常，但对另一个不适用，

InputData = LOAD 'above_data.txt';
GroupReq =  GROUP InputData BY $0 PARALLEL 2;
FinalOutput =   FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();

上面的代码创建了两个输出部分文件，但在第一个输入中，它很好地分割了数据，并将密钥 1 放入 part-r-00000 和 part-r-00001 中的密钥 2。但对于第二个输入，它会创建两个部分文件，但所有数据最终都在 part-r-00000 中。我缺少什么，我能做什么来强制数据根据唯一键分割成多个输出文件？

注意：对于第二个输入，如果我使用 PARALLEL 3 （3 个减速器），它会创建三个部分文件，并添加 part-0 中键 1 的所有数据 以及 part-3 文件中密钥 3 的所有数据。我发现这种行为很奇怪。顺便说一句，我正在使用 Cloudera CDH3B4。

原文

Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key:

1    hello
2    bla     
1    hi
2    works
2    end

But this data doesn't split:

1    hello
3    bla     
1    hi
3    works
3    end

The code that I used that works fine for one and not for the other is

InputData = LOAD 'above_data.txt';
GroupReq =  GROUP InputData BY $0 PARALLEL 2;
FinalOutput =   FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();

The above code creates two output part files but in first input it splits the data nicely and put the key 1 in part-r-00000 and key 2 in part-r-00001. But for the second input it creates two part files but all the data ends up in part-r-00000. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys?

Note: for the second input if I use PARALLEL 3 (3 reducers), it creates three part files and add all the data for key 1 in part-0 and all the data for key 3 in part-3 file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.

分享到QQ

分享到微博