将输入拆分为 PIG (Hadoop) 中的子字符串
假设我在 Pig 中有以下输入:
some
我想将其转换为:
s
so
som
some
我(还)没有找到一种方法来迭代 Pig 拉丁语中的 chararray。我找到了 TOKENIZE 函数,但它在单词边界上分裂。 那么“pig latin”可以做到这一点还是需要Java类来做到这一点?
Assume I have the following input in Pig:
some
And I would like to convert that into:
s
so
som
some
I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries.
So can "pig latin" do this or is this something that requires a Java class to do that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尼尔斯,
TOKENIZE 采用分隔符参数,因此您可以使其分割每个字母;但是我想不出一种方法让它产生重叠标记。
不过,用 Pig 编写 UDF 非常简单。您只需实现一个名为 EvalFunc 的简单接口(详细信息如下:http://wiki.apache.org/pig/UDFManual )。 Pig 的构建理念是用户编写自己的函数来处理大多数事情,因此编写自己的 UDF 是一件常见且自然的事情。
一个更简单的选择(尽管效率不高)是使用 Pig 流通过脚本传递数据(我发现快速编写 Perl 或 Python 脚本比为一次性作业实现 Java 类更快)。这里有一个例子: http: //www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- 它演示了预先存在的库、Perl 脚本、UDF 的使用,甚至是即时 awk 脚本。
Niels,
TOKENIZE takes a delimiter argument, so you can make it split each letter; however I can't think of a way to make it produce overlapping tokens.
It's pretty straightforward to write a UDF in Pig, though. You just implement a simple interface called EvalFunc (details here: http://wiki.apache.org/pig/UDFManual ). Pig was built around the idea of users writing their own functions to process most anything, and writing your own UDF is therefore a common and natural thing to do.
An even easier option, although not as efficient, is to use Pig streaming to pass your data through a script (I find whipping up a quick Perl or Python script to be faster than implementing Java classes for one-off jobs). There is an example of this here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- it demonstrates the use of a pre-existing library, a Perl script, a UDF, and even an on-the-fly awk script.
以下是您可以如何使用 Pig Streaming 和 Python 来实现这一点,而无需编写自定义 UDF:
假设您的数据只是 1 列单词。用于处理事情的 python 脚本(我们称之为 wordSeq.py)将是:
然后,在 Pig 脚本中,您告诉 Pig 您正在使用上述脚本的流式处理,并且您希望根据需要发送脚本:
Here is how you might do it with pig streaming and python without writing custom UDFs:
Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:
Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:
使用存钱罐图书馆。
http: //hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/piggybank/evaluation/string/SUBSTRING.html
使用如下:
Use the piggybank library.
http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/piggybank/evaluation/string/SUBSTRING.html
Use like this: