删除“字符串表达式”的过程/代码是什么?使用 Apache Pig 从文件中获取?

发布于 2024-12-26 14:25:01 字数 540 浏览 2 评论 0原文

A = load '/home/wrdtest.txt';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = filter B by word != 'the';

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

我只是想过滤文本。第三步过滤文本并从文本文件中删除“the”。但我想从文本中删除一组 499 个单词(停用词)。我尝试使用“|” (如 OR )如:

C = filter B by word != 'the|and|or'...but it didnt work.

您能否就此提出建议,我可以包含一个文本文件(如(stopwords.txt))以删除停用词。

我是 Pig 的天真的用户

A = load '/home/wrdtest.txt';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = filter B by word != 'the';

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

I just want to filter the text . The 3rd step filters the text and removes 'the' from text file. But i want to remove a set of 499 words (stop words) from the text. I tried to use '|' (as OR ) like :

C = filter B by word != 'the|and|or'...but it didnt work.

Can you please suggest on this and may i include a text file like (stopwords.txt) in order to remove the stop words.

I am a naive user of Pig

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我不吻晚风 2025-01-02 14:25:01

像删除停用词这样的事情非常复杂,因此不会出现在内置函数中。您需要编写一个用户定义函数,这相当简单易做。

-- load the data line by line
lines = LOAD 'datafile.txt' USING TextLoader() AS (line:chararray);

-- apply some sort of UDF that returns the exact line without the stop words
nostop = FOREACH lines GENERATE myudfs.removestop(line);

-- store the data out
STORE nonstop INTO 'datafile_nostop.txt';

将你的清单推送到任务中是另一回事。如果列表相对较小,大约有数千个,您可以将停用词烘焙到代码中(即,对列表进行硬编码),以便它可用。否则,您可以使用分布式缓存将文件推出。


根据您提供的更多信息,我可以建议另一种方法。不过,我上面使用 UDF 的方法仍然有效。

这种新方法将涉及您加载其他文件,然后有效地执行反连接以删除与列表匹配的内容。您需要确保 stopwords.txt 每行有一个单词才能使其正常工作。要进行反连接(即保留列表中与其他列表不匹配的内容),我将执行 左外连接(使用 replicated),然后过滤掉停用词列为空的地方(即没有匹配的停用词)。

A = load '/home/wrdtest.txt';

-- load the stop words list
SW = load '/home/stopwords.txt' as (stopword:chararray);    

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

-- join the data with a left outer join
-- using replicated should be done with the right relation (SW) is small
SW2 = join B by word LEFT OUTER, SW by stopword USING 'replicated';

-- filter out where the stopword is null, meaning it is not in the stopword list
C = filter SW2 by stopword IS NULL;

-- remove the stopword column that we don't need.
C = foreach C generate word;

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

Something like removing stop words is complicated enough that it is not going to be in the built-in functions. You'll need to write a user-defined function, which is quite simple to do.

-- load the data line by line
lines = LOAD 'datafile.txt' USING TextLoader() AS (line:chararray);

-- apply some sort of UDF that returns the exact line without the stop words
nostop = FOREACH lines GENERATE myudfs.removestop(line);

-- store the data out
STORE nonstop INTO 'datafile_nostop.txt';

Pushing that list of yours out to the tasks is another story. If the list is relatively small, in the order of thousands, you can bake the stop words into your code (i.e., hardcoding the list) so that it has it available. Otherwise, you could use the Distributed Cache to push the file out.


With the more information you provided, I can suggest an alternative approach. My above approach of using a UDF is still valid, though.

This new approach will involve you loading your other file, then effectively doing an anti-join to remove things that match the list. You need to make sure stopwords.txt has one word per line in order for this to work. To do the anti-join (i.e., keep the things the list that do not match the other list), I'll do a left outer join (using replicated), then filter out where the stop word column is null (i.e., it did not have a stop word that matched).

A = load '/home/wrdtest.txt';

-- load the stop words list
SW = load '/home/stopwords.txt' as (stopword:chararray);    

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

-- join the data with a left outer join
-- using replicated should be done with the right relation (SW) is small
SW2 = join B by word LEFT OUTER, SW by stopword USING 'replicated';

-- filter out where the stopword is null, meaning it is not in the stopword list
C = filter SW2 by stopword IS NULL;

-- remove the stopword column that we don't need.
C = foreach C generate word;

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';
旧伤还要旧人安 2025-01-02 14:25:01

我使用了唐纳德·迈纳的上述解决方案。

我将 JOIN 部分的关系修改为以下内容

SW2 = join B by word LEFT, SW by stopword;

并为我工作。

I used the above solution by Donald Miner.

I modified the relation about JOIN part as the following

SW2 = join B by word LEFT, SW by stopword;

and works for me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文