删除“字符串表达式”的过程/代码是什么?使用 Apache Pig 从文件中获取?
A = load '/home/wrdtest.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word != 'the';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/sample_data20';
我只是想过滤文本。第三步过滤文本并从文本文件中删除“the”。但我想从文本中删除一组 499 个单词(停用词)。我尝试使用“|” (如 OR )如:
C = filter B by word != 'the|and|or'...but it didnt work.
您能否就此提出建议,我可以包含一个文本文件(如(stopwords.txt))以删除停用词。
我是 Pig 的天真的用户
A = load '/home/wrdtest.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word != 'the';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/sample_data20';
I just want to filter the text . The 3rd step filters the text and removes 'the' from text file. But i want to remove a set of 499 words (stop words) from the text. I tried to use '|' (as OR ) like :
C = filter B by word != 'the|and|or'...but it didnt work.
Can you please suggest on this and may i include a text file like (stopwords.txt) in order to remove the stop words.
I am a naive user of Pig
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
像删除停用词这样的事情非常复杂,因此不会出现在内置函数中。您需要编写一个用户定义函数,这相当简单易做。
将你的清单推送到任务中是另一回事。如果列表相对较小,大约有数千个,您可以将停用词烘焙到代码中(即,对列表进行硬编码),以便它可用。否则,您可以使用分布式缓存将文件推出。
根据您提供的更多信息,我可以建议另一种方法。不过,我上面使用 UDF 的方法仍然有效。
这种新方法将涉及您加载其他文件,然后有效地执行反连接以删除与列表匹配的内容。您需要确保
stopwords.txt
每行有一个单词才能使其正常工作。要进行反连接(即保留列表中与其他列表不匹配的内容),我将执行 左外连接(使用 replicated),然后过滤掉停用词列为空的地方(即没有匹配的停用词)。Something like removing stop words is complicated enough that it is not going to be in the built-in functions. You'll need to write a user-defined function, which is quite simple to do.
Pushing that list of yours out to the tasks is another story. If the list is relatively small, in the order of thousands, you can bake the stop words into your code (i.e., hardcoding the list) so that it has it available. Otherwise, you could use the Distributed Cache to push the file out.
With the more information you provided, I can suggest an alternative approach. My above approach of using a UDF is still valid, though.
This new approach will involve you loading your other file, then effectively doing an anti-join to remove things that match the list. You need to make sure
stopwords.txt
has one word per line in order for this to work. To do the anti-join (i.e., keep the things the list that do not match the other list), I'll do a left outer join (using replicated), then filter out where the stop word column is null (i.e., it did not have a stop word that matched).我使用了唐纳德·迈纳的上述解决方案。
我将 JOIN 部分的关系修改为以下内容
并为我工作。
I used the above solution by Donald Miner.
I modified the relation about JOIN part as the following
and works for me.