如何读取 PIG UDF 中的静态文件

发布于 2024-10-18 23:18:31 字数 640 浏览 7 评论 0原文

我是 PIG 和 Hadoop 的新手。我编写了一个 PIG UDF，它对 String 进行操作并返回一个字符串。我实际上使用了一个已经存在的 jar 中的类，其中包含 udf 中的业务逻辑。类构造函数采用 2 个文件名作为输入，用于构建一些用于处理输入的字典。如何让它在mapreduce模式下工作我尝试在pig本地模式下传递文件名，它工作得很好。但我不知道如何让它在mapreduce模式下工作？分布式缓存能解决问题吗？

这是我的代码

REGISTER tokenParser.jar

REGISTER sampleudf.jar;


DEFINE TOKENPARSER com.yahoo.sample.ParseToken('conf/input1.txt','conf/input2.xml');

A = LOAD './inputHOP.txt' USING PigStorage() AS (tok:chararray);
B = FOREACH A GENERATE TOKENPARSER(tok);
STORE B into 'newTokout' USING PigStorage();

据我了解，tokenParser.jar 必须使用某种 BufferedInputReader。是否可以在不更改 tokenParser.jar 的情况下使其工作

原文

I am new to PIG and Hadoop. I have written a PIG UDF which operates on String and returns a string. I actually use a class from an already existing jar which contains the business logic in the udf. The class constructor takes 2 filenames as input which it uses for building some dictionary used for processing the input. How to get it working in mapreduce mode I tried passing the filenames in pig local mode it works fine. But I dont know how to make it work in mapreduce mode? Can distributed cache solve the problem?

Here is my code

REGISTER tokenParser.jar

REGISTER sampleudf.jar;


DEFINE TOKENPARSER com.yahoo.sample.ParseToken('conf/input1.txt','conf/input2.xml');

A = LOAD './inputHOP.txt' USING PigStorage() AS (tok:chararray);
B = FOREACH A GENERATE TOKENPARSER(tok);
STORE B into 'newTokout' USING PigStorage();

From what I understand is tokenParser.jar must be using some sort of BufferedInputReader. Is it possible to make it work without changing tokenParser.jar

分享到QQ

分享到微博