相当于 linux 'diff'在阿帕奇猪中
我希望能够对两个大文件进行标准差异。我有一些可以工作的东西,但它并不像命令行上的 diff 那么快。
A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';
有人有更好的方法来做到这一点吗?
I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.
A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';
Anyone got any better ways to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我使用以下方法。 (我的 JOIN 方法非常相似,但此方法不会复制 diff 与复制行的行为)。正如前段时间提出的那样,也许您只使用一个减速器作为 Pig 有一个算法来调整0.8中的reducer数量吗?
diff
(1) 工具类似,将为正确的文件返回正确数量的额外重复项。diff
(1) 工具不同,顺序并不重要(实际上JOIN 方法执行sort -u
而 UNION 执行sortdiff)
:
使用 UNION:
性能
diff
(1) 似乎仅在内存中运行,而 Hadoop 利用流磁盘。I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?
diff
(1) tool and will return the correct number of extra duplicates for the correct filediff
(1) tool, order is not important (effectively the JOIN approach performssort -u <foo.txt> | diff
while UNION performssort <foo> | diff)
Using JOIN:
Using UNION:
Performance
diff
(1) only operates in-memory, while Hadoop leverages streaming disks.