如何“重新分组”?猪的关系?

发布于 2024-12-07 06:50:14 字数 1085 浏览 0 评论 0原文

假设我有一个输入文件 input.dat ,如下所示:

apples 10
oranges 30
apples 6
pears 5

现在,当我加载、分组和投影数据时:

sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group), SUM(sources.b);
dump projection;

我得到以下信息:

apples 16
oranges 30
pears 5

现在,我想“重新分组”将 SUM(sources.b) 低于某个阈值的数据合并到一行中。举个例子,如果阈值是 20,我会得到:

other 21
oranges 30

因为“苹果”和“梨”的总和都低于阈值 20。

在我看来,我可以遵循几种不同的方法:

  1. 使用 < grouped 上的 code>SPLIT 运算符创建两个关系:above_thresholdbelow_threshold。然后投影 below_thresholda 的值替换为“other”并重新组合。最后UNION将结果与above_threshold结合在一起,然后再次运行最终投影。
  2. 或者,完全遵循原始脚本,但在创建投影时,有条件地生成a(基于SUM(sources.b)),然后重新生成-group projection (将所有“其他”行分组在一起),然后再次投影(以展平重新分组的数据)。

上述方法中的一种明显优于另一种吗?或者是否有另一种更有效或更易于维护的方法?

Assume I have an input file input.dat that looks like this:

apples 10
oranges 30
apples 6
pears 5

Now, when I load, group, and project the data:

sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group), SUM(sources.b);
dump projection;

I get the following:

apples 16
oranges 30
pears 5

Now, I want to "re-group" the data where the SUM(sources.b) is below some threshold into a single line. As an example, if the threshold was 20, I would get:

other 21
oranges 30

because the sum for both "apples" and "pears" was below the threshold of 20.

It seems to me that I can follow a couple of different approaches:

  1. Use the SPLIT operator on grouped to create two relations: above_threshold and below_threshold. Then project below_threshold to replace the value of a with "other" and regroup. Finally UNION that result together with above_threshold and then run the final projection again.
  2. Or, follow the original script exactly, but when creating projection, generate a conditionally (based on SUM(sources.b)), then re-group projection (to group all of the "other" rows together), and then project again (to flatten the re-grouped data).

Is one of the above approaches clearly better than the other? Or is there another approach that will be more efficient or easier to maintain?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

耳根太软 2024-12-14 06:50:14

选项 1 更好。这是因为选项 1 只需将 below_threshold 数据传递到 M/R 记录计数中;而在选项2中,你似乎正在重新组合一切。

此外,方法 1 有一些好处,最值得注意的是:

  • below_threshold 计数将会非常快,因为您只需要 1 个减速器,并且组合器只需一个键即可创造奇迹。
  • 根据您的应用程序,您不需要UNION。您可以只输出到两个位置,然后通过将它们视为猪外部的相同输出来“联合”。例如,您仍然可以执行 hadoop fs -getmerge my_out/*/part-r-* output 来获取两个输出。

所以,我看到你的 Pig 脚本看起来像:

sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group) as n, SUM(sources.b) as s;
SPLIT projection into above_threshold if s >= 20, below_threshold if s < 20;
dump above_threshold;

below_grouped = GROUP below_threshold BY 'other' PARALLEL 1;
below_projection = FOREACH below_grouped GENERATE group, SUM(below_threshold.s);
dump below_projection;

Option 1 is better. This is because option 1 only has to pass around below_threshold data into a M/R record count; while in option 2, you are regrouping everything, it seems like.

Also, there are a few good things about approach 1, most notably:

  • The below_threshold count is going to be pretty fast because you only need 1 reducer and the combiner is going to do wonders with only one key.
  • Depending on your application, you don't need to UNION. You can just output to two locations, and then "union" by treating them as the same output externally from pig. For example, you can still do hadoop fs -getmerge my_out/*/part-r-* output to grab both outputs.

So, I see your Pig script looking like:

sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group) as n, SUM(sources.b) as s;
SPLIT projection into above_threshold if s >= 20, below_threshold if s < 20;
dump above_threshold;

below_grouped = GROUP below_threshold BY 'other' PARALLEL 1;
below_projection = FOREACH below_grouped GENERATE group, SUM(below_threshold.s);
dump below_projection;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文