将输入拆分为 PIG (Hadoop) 中的子字符串

发布于 2024-08-04 17:23:21 字数 221 浏览 11 评论 0原文

假设我在 Pig 中有以下输入：

some

我想将其转换为：

s
so
som
some

我（还）没有找到一种方法来迭代 Pig 拉丁语中的 chararray。我找到了 TOKENIZE 函数，但它在单词边界上分裂。那么“pig latin”可以做到这一点还是需要Java类来做到这一点？

原文

Assume I have the following input in Pig:

some

And I would like to convert that into:

s
so
som
some

I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries.
So can "pig latin" do this or is this something that requires a Java class to do that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你曾走过我的故事 2024-08-11 17:23:21

尼尔斯，
TOKENIZE 采用分隔符参数，因此您可以使其分割每个字母；但是我想不出一种方法让它产生重叠标记。

不过，用 Pig 编写 UDF 非常简单。您只需实现一个名为 EvalFunc 的简单接口（详细信息如下：http://wiki.apache.org/pig/UDFManual ）。 Pig 的构建理念是用户编写自己的函数来处理大多数事情，因此编写自己的 UDF 是一件常见且自然的事情。

一个更简单的选择（尽管效率不高）是使用 Pig 流通过脚本传递数据（我发现快速编写 Perl 或 Python 脚本比为一次性作业实现 Java 类更快）。这里有一个例子： http: //www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- 它演示了预先存在的库、Perl 脚本、UDF 的使用，甚至是即时 awk 脚本。

回复收藏 0 原文

纵性 2024-08-11 17:23:21

以下是您可以如何使用 Pig Streaming 和 Python 来实现这一点，而无需编写自定义 UDF：

假设您的数据只是 1 列单词。用于处理事情的 python 脚本（我们称之为 wordSeq.py）将是：

#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
  word = word.rstrip()
  sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')

然后，在 Pig 脚本中，您告诉 Pig 您正在使用上述脚本的流式处理，并且您希望根据需要发送脚本：

-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);

Here is how you might do it with pig streaming and python without writing custom UDFs:

Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:

#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
  word = word.rstrip()
  sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')

Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:

-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);

回复收藏 0 原文

緦唸λ蓇 2024-08-11 17:23:21

使用存钱罐图书馆。

http: //hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/piggybank/evaluation/string/SUBSTRING.html

使用如下：

REGISTER /path/to/piggybank.jar;
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 0, 10);

Use the piggybank library.

http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/piggybank/evaluation/string/SUBSTRING.html

Use like this:

REGISTER /path/to/piggybank.jar;
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 0, 10);

回复收藏 0 原文

~没有更多了~