相当于 linux 'diff'在阿帕奇猪中

发布于 2024-11-05 06:57:56 字数 392 浏览 7 评论 0原文

我希望能够对两个大文件进行标准差异。我有一些可以工作的东西，但它并不像命令行上的 diff 那么快。

A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';

有人有更好的方法来做到这一点吗？

原文

I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.

A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';

Anyone got any better ways to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云仙小弟 2024-11-12 06:57:56

我使用以下方法。（我的 JOIN 方法非常相似，但此方法不会复制 diff 与复制行的行为）。正如前段时间提出的那样，也许您只使用一个减速器作为 Pig 有一个算法来调整0.8中的reducer数量吗？

我使用的两种方法在性能上彼此相差百分之几，但不会以相同的方式对待重复项
JOIN 方法会折叠重复项（因此，如果一个文件比另一个文件具有更多重复项，则此方法将不会输出重复项）
UNION 方法有效与 Unix diff(1) 工具类似，将为正确的文件返回正确数量的额外重复项。
与 Unix diff(1) 工具不同，顺序并不重要（实际上JOIN 方法执行 sort -u 而 UNION 执行 sortdiff)
如果您有大量的重复行；，那么由于连接，事情会变慢（如果您的使用允许，首先对原始数据执行 DISTINCT）
如果您的行很长（例如大小> 1KB），那么建议使用 DataFu MD5 UDF 并且仅在哈希上存在差异，然后加入 使用JOIN

：

SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

使用 UNION：

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

性能

大约需要 10 分钟使用具有 18 个节点的 LZO 压缩输入，差异超过 200GB（1,055,687,930 行）。
每种方法仅需要一个 Map/Reduce 周期。
这导致每个节点每分钟大约 1.8GB diff（吞吐量不是很大，但在我的系统上，diff(1) 似乎仅在内存中运行，而 Hadoop 利用流磁盘。

I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?

Both approaches I use are within a few percent of eachother in performance but do not treat duplicates the same
The JOIN approach collapses duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate)
The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff)
If you have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins (if your use allows, perform a DISTINCT on the raw data first)
If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting

Using JOIN:

SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

Using UNION:

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

Performance

It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
Each approach only takes one Map/Reduce cycle.
This results in roughly 1.8GB diffed per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.

回复收藏 0 原文

~没有更多了~