使用 Lucene 提取英语单词词干

发布于 2024-10-25 09:30:55 字数 513 浏览 11 评论 0原文

我正在 Java 应用程序中处理一些英文文本,我需要对它们进行词干处理。 例如,从文本“amenities/amenity”中我需要获取“amenit”。

该函数看起来像:

String stemTerm(String term){
   ...
}

我找到了 Lucene 分析器,但它看起来对于我的需要来说太复杂了。 http://lucene.apache.org/java /2_2_0/api/org/apache/lucene/analysis/PorterStemFilter.html

有没有办法使用它来词干而不构建分析器?我不了解所有分析器业务...

编辑:我实际上需要词干+词形还原。 Lucene可以做到这一点吗?

I'm processing some English texts in a Java application, and I need to stem them.
For example, from the text "amenities/amenity" I need to get "amenit".

The function looks like:

String stemTerm(String term){
   ...
}

I've found the Lucene Analyzer, but it looks way too complicated for what I need.
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/PorterStemFilter.html

Is there a way to use it to stem words without building an Analyzer? I don't understand all the Analyzer business...

EDIT: I actually need a stemming + lemmatization. Can Lucene do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

打小就很酷 2024-11-01 09:30:55

SnowballAnalyzer 已弃用,您可以使用 Lucene Porter Stemmer 代替:

 PorterStemmer stem = new PorterStemmer();
 stem.setCurrent(word);
 stem.stem();
 String result = stem.getCurrent();

希望有帮助!

SnowballAnalyzer is deprecated, you can use Lucene Porter Stemmer instead:

 PorterStemmer stem = new PorterStemmer();
 stem.setCurrent(word);
 stem.stem();
 String result = stem.getCurrent();

Hope this help!

数理化全能战士 2024-11-01 09:30:55
import org.apache.lucene.analysis.PorterStemmer;
...
String stemTerm (String term) {
    PorterStemmer stemmer = new PorterStemmer();
    return stemmer.stem(term);
}

请参阅此处了解更多详情。如果您只想进行词干提取,那么您应该使用 this 而不是 Lucene。

编辑:在将term传递给stem()之前,您应该将其小写。

import org.apache.lucene.analysis.PorterStemmer;
...
String stemTerm (String term) {
    PorterStemmer stemmer = new PorterStemmer();
    return stemmer.stem(term);
}

See here for more details. If stemming is all you want to do, then you should use this instead of Lucene.

Edit: You should lowercase term before passing it to stem().

盗梦空间 2024-11-01 09:30:55

你为什么不使用“EnglishAnalyzer”?使用它很简单,我认为它可以解决您的问题:

EnglishAnalyzer en_an = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "your_field", en_an);
String str = "amenities";
System.out.println("result: " + parser.parse(str)); //amenit

希望它可以帮助您!

Why aren't you using the "EnglishAnalyzer"? It's simple to use it and I think it'd solve your problem:

EnglishAnalyzer en_an = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "your_field", en_an);
String str = "amenities";
System.out.println("result: " + parser.parse(str)); //amenit

Hope it helps you!

梦年海沫深 2024-11-01 09:30:55

前面的示例将词干应用于搜索查询,因此,如果您有兴趣对全文进行词干提取,可以尝试以下操作:

import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.snowball.*;
import org.apache.lucene.util.*;
...
public class Stemmer{
    public static String Stem(String text, String language){
        StringBuffer result = new StringBuffer();
        if (text!=null && text.trim().length()>0){
            StringReader tReader = new StringReader(text);
            Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_35,language);
            TokenStream tStream = analyzer.tokenStream("contents", tReader);
            TermAttribute term = tStream.addAttribute(TermAttribute.class);

            try {
                while (tStream.incrementToken()){
                    result.append(term.term());
                    result.append(" ");
                }
            } catch (IOException ioe){
                System.out.println("Error: "+ioe.getMessage());
            }
        }

        // If, for some reason, the stemming did not happen, return the original text
        if (result.length()==0)
            result.append(text);
        return result.toString().trim();
    }

    public static void main (String[] args){
        Stemmer.Stem("Michele Bachmann amenities pressed her allegations that the former head of her Iowa presidential bid was bribed by the campaign of rival Ron Paul to endorse him, even as one of her own aides denied the charge.", "English");
    }
}

TermAttribute 类已被弃用,Lucene 4 中将不再支持,但文档并不清楚是什么在它的地方使用。

同样在第一个示例中,PorterStemmer 不作为类(隐藏)提供,因此您不能直接使用它。

希望这有帮助。

The previous example applies stemming to a search query, so if you are interesting to stem a full text you can try the following:

import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.snowball.*;
import org.apache.lucene.util.*;
...
public class Stemmer{
    public static String Stem(String text, String language){
        StringBuffer result = new StringBuffer();
        if (text!=null && text.trim().length()>0){
            StringReader tReader = new StringReader(text);
            Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_35,language);
            TokenStream tStream = analyzer.tokenStream("contents", tReader);
            TermAttribute term = tStream.addAttribute(TermAttribute.class);

            try {
                while (tStream.incrementToken()){
                    result.append(term.term());
                    result.append(" ");
                }
            } catch (IOException ioe){
                System.out.println("Error: "+ioe.getMessage());
            }
        }

        // If, for some reason, the stemming did not happen, return the original text
        if (result.length()==0)
            result.append(text);
        return result.toString().trim();
    }

    public static void main (String[] args){
        Stemmer.Stem("Michele Bachmann amenities pressed her allegations that the former head of her Iowa presidential bid was bribed by the campaign of rival Ron Paul to endorse him, even as one of her own aides denied the charge.", "English");
    }
}

The TermAttribute class has been deprecated and will not longer be supported in Lucene 4, but the documentation is not clear on what to use at its place.

Also in the first example the PorterStemmer is not available as a class (hidden) so you cannot use it directly.

Hope this helps.

伤痕我心 2024-11-01 09:30:55

以下是在 JAVA 中使用 Snowball Stemmer 的方法:

import org.tartarus.snowball.ext.EnglishStemmer;

EnglishStemmer english = new EnglishStemmer();
String[] words = tokenizer("bank banker banking");
for(int i = 0; i < words.length; i++){
        english.setCurrent(words[i]);
        english.stem();
        System.out.println(english.getCurrent());
}

Here is how you can use Snowball Stemmer in JAVA:

import org.tartarus.snowball.ext.EnglishStemmer;

EnglishStemmer english = new EnglishStemmer();
String[] words = tokenizer("bank banker banking");
for(int i = 0; i < words.length; i++){
        english.setCurrent(words[i]);
        english.stem();
        System.out.println(english.getCurrent());
}
哭了丶谁疼 2024-11-01 09:30:55

Ling pipeline 提供了许多分词器。它们可用于词干提取和停用词删除。这是一种简单而有效的词干提取方法。

Ling pipe provides a number of tokenizers . They can be used for stemming and stop word removal . Its a simple and a effective means of stemming.

怂人 2024-11-01 09:30:55

由于PorterStemmer不是公开的,我们无法调用PorterStemmer的stem函数。

相反,我们可以使用 KStemmer/KStemFilter 将单词词干到其根单词。

下面是 scala 代码片段,它接受字符串并转换为词干字符串

import org.apache.lucene.analysis.core.WhitespaceTokenizer
导入 org.apache.lucene.analysis.en.KStemFilter

导入 java.io.StringReader

对象 Stemmer {
def 词干(输入:字符串):字符串={

val stemmed_string = new StringBuilder()

val inputReader = new StringReader(input.toLowerCase)

val whitespaceTokenizer = new WhitespaceTokenizer()
whitespaceTokenizer.setReader(inputReader)

val kStemmedTokenStream = new KStemFilter(whitespaceTokenizer)
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute

val charTermAttribute = kStemmedTokenStream.addAttribute(classOf[CharTermAttribute])

kStemmedTokenStream.reset
while (kStemmedTokenStream.incrementToken) {
  val term = charTermAttribute.toString
  stemmed_string.append(term+" ")
}
stemmed_string.toString().trim.toUpperCase

}

}

Since the PorterStemmer is not public, we ca't call the stem function of PorterStemmer.

Instead we can KStemmer/KStemFilter to stemming the words to its root word.

Below is the scala code snippet which accepts the string and transforms to stemmed string

import org.apache.lucene.analysis.core.WhitespaceTokenizer
import org.apache.lucene.analysis.en.KStemFilter

import java.io.StringReader

object Stemmer {
def stem(input:String):String={

val stemmed_string = new StringBuilder()

val inputReader = new StringReader(input.toLowerCase)

val whitespaceTokenizer = new WhitespaceTokenizer()
whitespaceTokenizer.setReader(inputReader)

val kStemmedTokenStream = new KStemFilter(whitespaceTokenizer)
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute

val charTermAttribute = kStemmedTokenStream.addAttribute(classOf[CharTermAttribute])

kStemmedTokenStream.reset
while (kStemmedTokenStream.incrementToken) {
  val term = charTermAttribute.toString
  stemmed_string.append(term+" ")
}
stemmed_string.toString().trim.toUpperCase

}

}

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文