如何从 Yahoo PigLatin UDF 中将文件加载到 DataBag 中？

发布于 2024-08-30 02:53:01 字数 1885 浏览 9 评论 0原文

我有一个 Pig 程序，我试图计算两个袋子之间的最小中心。为了让它工作，我发现我需要将袋子合并到一个数据集中。整个手术需要很长时间。我想要么从 UDF 中的磁盘打开其中一个包，要么能够将另一个关系传递到 UDF 中，而无需 COGROUP......

代码：

# **** Load files for iteration ****
register myudfs.jar;
wordcounts = LOAD 'input/wordcounts.txt' USING PigStorage('\t') AS (PatentNumber:chararray, word:chararray, frequency:double);
centerassignments = load 'input/centerassignments/part-*' USING PigStorage('\t') AS (PatentNumber: chararray, oldCenter: chararray, newCenter: chararray);
kcenters = LOAD 'input/kcenters/part-*' USING PigStorage('\t') AS (CenterID:chararray, word:chararray, frequency:double);
kcentersa1 = CROSS centerassignments, kcenters;
kcentersa = FOREACH kcentersa1 GENERATE centerassignments::PatentNumber as PatentNumber, kcenters::CenterID as CenterID, kcenters::word as word, kcenters::frequency as frequency;

#***** Assign to nearest k-mean *******
assignpre1 = COGROUP wordcounts by PatentNumber, kcentersa by PatentNumber;
assignwork2 = FOREACH assignpre1 GENERATE group as PatentNumber, myudfs.kmeans(wordcounts, kcentersa) as CenterID;

基本上我的问题是，对于我需要传递的每项专利子关系（字数、kcenters）。为了做到这一点，我先进行交叉，然后按 PatentNumber 进行 COGROUP，以获得集合 PatentNumber、{wordcounts}、{kcenters}。如果我能找到一种方法来传递关系或从 UDF 内部打开中心，那么我可以按 PatentNumber 对字数进行分组并运行 myudfs.kmeans(wordcount)，如果没有 CROSS/COGROUP，希望速度会快得多。

这是一项昂贵的操作。目前，这大约需要 20 分钟，并且似乎会占用 CPU/RAM。我想如果没有 CROSS 可能会更有效率。我不确定它会更快，所以我想尝试一下。

无论如何，看起来从 Pig 内部调用 Loading 函数需要一个 PigContext 对象，而我没有从 evalfunc 获得该对象。要使用 hadoop 文件系统，我还需要一些初始对象，但我不知道如何获取它们。所以我的问题是如何从 PIG UDF 中打开 hadoop 文件系统中的文件？我还通过 main 运行 UDF 进行调试。所以我需要在调试模式下从普通文件系统加载。

另一个更好的想法是是否有一种方法可以将关系传递到 UDF 中而不需要 CROSS/COGROUP。这将是理想的，特别是如果关系驻留在内存中..即能够执行 myudfs.kmeans(wordcounts, kcenters) 而无需与 kcenters 进行 CROSS/COGROUP...

但基本思想是用 IO 换取 RAM/CPU循环。

无论如何，我们将不胜感激任何帮助，除了最简单的 UDF 之外，PIG UDF 没有很好的记录，即使在 UDF 手册中也是如此。

原文

I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without needing to COGROUP......

Code:

# **** Load files for iteration ****
register myudfs.jar;
wordcounts = LOAD 'input/wordcounts.txt' USING PigStorage('\t') AS (PatentNumber:chararray, word:chararray, frequency:double);
centerassignments = load 'input/centerassignments/part-*' USING PigStorage('\t') AS (PatentNumber: chararray, oldCenter: chararray, newCenter: chararray);
kcenters = LOAD 'input/kcenters/part-*' USING PigStorage('\t') AS (CenterID:chararray, word:chararray, frequency:double);
kcentersa1 = CROSS centerassignments, kcenters;
kcentersa = FOREACH kcentersa1 GENERATE centerassignments::PatentNumber as PatentNumber, kcenters::CenterID as CenterID, kcenters::word as word, kcenters::frequency as frequency;

#***** Assign to nearest k-mean *******
assignpre1 = COGROUP wordcounts by PatentNumber, kcentersa by PatentNumber;
assignwork2 = FOREACH assignpre1 GENERATE group as PatentNumber, myudfs.kmeans(wordcounts, kcentersa) as CenterID;

basically my issue is that for each patent I need to pass the sub relations (wordcounts, kcenters). In order to do this, I do a cross and then a COGROUP by PatentNumber in order to get the set PatentNumber, {wordcounts}, {kcenters}. If I could figure a way to pass a relation or open up the centers from within the UDF, then I could just GROUP wordcounts by PatentNumber and run myudfs.kmeans(wordcount) which is hopefully much faster without the CROSS/COGROUP.

This is an expensive operation. Currently this takes about 20 minutes and appears to tack the CPU/RAM. I was thinking it might be more efficient without the CROSS. I'm not sure it will be faster, so I'd like to experiment.

Anyway it looks like calling the Loading functions from within Pig needs a PigContext object which I don't get from an evalfunc. And to use the hadoop file system, I need some initial objects as well, which I don't see how to get. So my question is how can I open a file from the hadoop file system from within a PIG UDF? I also run the UDF via main for debugging. So I need to load from the normal filesystem when in debug mode.

Another better idea would be if there was a way to pass a relation into a UDF without needing to CROSS/COGROUP. This would be ideal, particularly if the relation resides in memory.. ie being able to do myudfs.kmeans(wordcounts, kcenters) without needing the CROSS/COGROUP with kcenters...

But the basic idea is to trade IO for RAM/CPU cycles.

Anyway any help will be much appreciated, the PIG UDFs aren't super well documented beyond the most simple ones, even in the UDF manual.

分享到QQ

分享到微博