删除java中的停用词
我有一个停用词列表,其中包含大约 30 个单词和一组文章。
我想解析每篇文章并从中删除那些停用词。
我不确定最有效的方法是什么。
例如,我可以循环遍历停止列表,并用空格替换文章中的单词(如果存在),但这似乎不太好。
谢谢
I have a list of stop words which contain around 30 words and a set of articles .
I want to parse each article and remove those stop words from it .
I am not sure what is the most effecient way to do it.
for instance I can loop through stop list and replace the word in article if exist with whitespace but it does not seem good .
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
java.util.Set
java.util.Set
替换单词将是低效的。最好的选择可能是逐字解析文章,并将每个单词复制到一个新的 StringBuffer 中;除非它是一个停用词,在这种情况下,您可以复制您想要的任何内容来代替它。这里 StringBuffer 比 String 效率高很多。
如果只有三十个左右,如何存储停用词可能并不重要。一套可能是一个不错的选择。
Replacing the words will be inefficient. Your best bet is probably to parse the article word by word, and copy each word to a new StringBuffer; unless it is a stopword, in which case you copy whatever you want in its place. StringBuffer is much more efficient than String here.
How you store the stopwords is probably unimportant if there are only thirty or so. A Set is probably a good bet.
根据 Sun Java 教程,您可以使用正则表达式中与 Perl 兼容的
\b
分隔符。如果用它们包围单词,它将仅匹配该单词,无论它后面是标点符号或空格还是前面带有标点符号或空格。According to the Sun Java Tutorials, you can use the Perl-compatible
\b
deliminator in your regular expressions. If you surround the word with them, it will match only that word, whether it's followed by or prefixed with a punctuation character or whitespace.当且仅当它不在停用词列表中时,从输入中读取一个单词,并将其复制到 StringBuilder(或放置结果的任何位置)。如果将停用词放入哈希表之类的东西中,您将能够更快地搜索它们。
编辑:哎呀,不知道我在想什么,但你想要一个集合,而不是哈希表(或任何其他字典)。
Read a word from the input, and copy it to your StringBuilder (or wherever you're putting the result) if and only if it's not in the list of stop words. You'll be able to search for them faster if you put the stop words into something like a HashTable.
Edit: oops, don't know what I was thinking, but you want a set, not a HashTable (or any other Dictionary).