如何“重新分组”?猪的关系?
假设我有一个输入文件 input.dat
,如下所示:
apples 10
oranges 30
apples 6
pears 5
现在,当我加载、分组和投影数据时:
sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group), SUM(sources.b);
dump projection;
我得到以下信息:
apples 16
oranges 30
pears 5
现在,我想“重新分组”将 SUM(sources.b)
低于某个阈值的数据合并到一行中。举个例子,如果阈值是 20,我会得到:
other 21
oranges 30
因为“苹果”和“梨”的总和都低于阈值 20。
在我看来,我可以遵循几种不同的方法:
- 使用 <
grouped
上的 code>SPLIT 运算符创建两个关系:above_threshold
和below_threshold
。然后投影below_threshold
将a
的值替换为“other”并重新组合。最后UNION
将结果与above_threshold
结合在一起,然后再次运行最终投影。 - 或者,完全遵循原始脚本,但在创建
投影
时,有条件地生成a
(基于SUM(sources.b)
),然后重新生成-groupprojection
(将所有“其他”行分组在一起),然后再次投影(以展平重新分组的数据)。
上述方法中的一种明显优于另一种吗?或者是否有另一种更有效或更易于维护的方法?
Assume I have an input file input.dat
that looks like this:
apples 10
oranges 30
apples 6
pears 5
Now, when I load, group, and project the data:
sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group), SUM(sources.b);
dump projection;
I get the following:
apples 16
oranges 30
pears 5
Now, I want to "re-group" the data where the SUM(sources.b)
is below some threshold into a single line. As an example, if the threshold was 20, I would get:
other 21
oranges 30
because the sum for both "apples" and "pears" was below the threshold of 20.
It seems to me that I can follow a couple of different approaches:
- Use the
SPLIT
operator ongrouped
to create two relations:above_threshold
andbelow_threshold
. Then projectbelow_threshold
to replace the value ofa
with "other" and regroup. FinallyUNION
that result together withabove_threshold
and then run the final projection again. - Or, follow the original script exactly, but when creating
projection
, generatea
conditionally (based onSUM(sources.b)
), then re-groupprojection
(to group all of the "other" rows together), and then project again (to flatten the re-grouped data).
Is one of the above approaches clearly better than the other? Or is there another approach that will be more efficient or easier to maintain?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
选项 1 更好。这是因为选项 1 只需将
below_threshold
数据传递到 M/R 记录计数中;而在选项2中,你似乎正在重新组合一切。此外,方法 1 有一些好处,最值得注意的是:
below_threshold
计数将会非常快,因为您只需要 1 个减速器,并且组合器只需一个键即可创造奇迹。UNION
。您可以只输出到两个位置,然后通过将它们视为猪外部的相同输出来“联合”。例如,您仍然可以执行hadoop fs -getmerge my_out/*/part-r-* output
来获取两个输出。所以,我看到你的 Pig 脚本看起来像:
Option 1 is better. This is because option 1 only has to pass around
below_threshold
data into a M/R record count; while in option 2, you are regrouping everything, it seems like.Also, there are a few good things about approach 1, most notably:
below_threshold
count is going to be pretty fast because you only need 1 reducer and the combiner is going to do wonders with only one key.UNION
. You can just output to two locations, and then "union" by treating them as the same output externally from pig. For example, you can still dohadoop fs -getmerge my_out/*/part-r-* output
to grab both outputs.So, I see your Pig script looking like: