在 Lucene 中组合分析器的最佳实践是什么?

发布于 2024-10-20 17:52:48 字数 1773 浏览 1 评论 0原文

我遇到一种情况,我在 Lucene 中使用 StandardAnalyzer 来索引文本字符串,如下所示:

public void indexText(String suffix, boolean includeStopWords)  {        
    StandardAnalyzer analyzer = null;


    if (includeStopWords) {
        analyzer = new StandardAnalyzer(Version.LUCENE_30);
    }
    else {

        // Get Stop_Words to exclude them.
        Set<String> stopWords = (Set<String>) Stop_Word_Listener.getStopWords();      
        analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
    }

    try {

        // Index text.
        Directory index = new RAMDirectory();
        IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);            
        this.addTextToIndex(w, this.getTextToIndex());
        w.close();

        // Read index.
        IndexReader ir = IndexReader.open(index);
        Text_TermVectorMapper ttvm = new Text_TermVectorMapper();

        int docId = 0;

        ir.getTermFreqVector(docId, PropertiesFile.getProperty(text), ttvm);

        // Set output.
        this.setWordFrequencies(ttvm.getWordFrequencies());
        w.close();
    }
    catch(Exception ex) {
        logger.error("Error message\n", ex);
    }
}

private void addTextToIndex(IndexWriter w, String value) throws IOException {
    Document doc = new Document();
    doc.add(new Field(text), value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
    w.addDocument(doc);
}

效果非常好,但我也想将其与使用 SnowballAnalyzer 的词干结合起来。

该类还有两个实例变量,显示在下面的构造函数中:

public Text_Indexer(String textToIndex) {
    this.textToIndex = textToIndex;
    this.wordFrequencies = new HashMap<String, Integer>();
}

谁能告诉我如何使用上面的代码最好地实现此目的?

谢谢

摩根先生。

I have a situation where I'm using a StandardAnalyzer in Lucene to index text strings as follows:

public void indexText(String suffix, boolean includeStopWords)  {        
    StandardAnalyzer analyzer = null;


    if (includeStopWords) {
        analyzer = new StandardAnalyzer(Version.LUCENE_30);
    }
    else {

        // Get Stop_Words to exclude them.
        Set<String> stopWords = (Set<String>) Stop_Word_Listener.getStopWords();      
        analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
    }

    try {

        // Index text.
        Directory index = new RAMDirectory();
        IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);            
        this.addTextToIndex(w, this.getTextToIndex());
        w.close();

        // Read index.
        IndexReader ir = IndexReader.open(index);
        Text_TermVectorMapper ttvm = new Text_TermVectorMapper();

        int docId = 0;

        ir.getTermFreqVector(docId, PropertiesFile.getProperty(text), ttvm);

        // Set output.
        this.setWordFrequencies(ttvm.getWordFrequencies());
        w.close();
    }
    catch(Exception ex) {
        logger.error("Error message\n", ex);
    }
}

private void addTextToIndex(IndexWriter w, String value) throws IOException {
    Document doc = new Document();
    doc.add(new Field(text), value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
    w.addDocument(doc);
}

Which works perfectly well but I would like to combine this with stemming using a SnowballAnalyzer as well.

This class also has two instance variables shown in a constructor below:

public Text_Indexer(String textToIndex) {
    this.textToIndex = textToIndex;
    this.wordFrequencies = new HashMap<String, Integer>();
}

Can anyone tell me how best to achieve this with the code above?

Thanks

Mr Morgan.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情绪 2024-10-27 17:52:48

Lucene 提供了 org.apache.lucene.analysis.Analyzer 基类,如果您想编写自己的分析器,可以使用该基类。
您可以查看扩展了Analyzer 的org.apache.lucene.analysis.standard.StandardAnalyzer 类。

然后,在 YourAnalyzer 中,您将使用这些分析器使用的过滤器来链接 StandardAnalyzer 和 SnowballAnalyzer,如下所示:

TokenStream result = new StandardFilter(tokenStream);
result = new SnowballFilter(result, stopSet);

然后,在现有代码中,您将能够使用自己的链接 Standard 和 Snowball 过滤器的分析器实现来构造 IndexWriter。

完全偏离主题:
我想您最终需要设置处理请求的自定义方式。这已经在 Solr 内部实现了。

首先通过扩展 SearchComponent 编写您自己的搜索组件并在 SolrConfig.xml 中定义它,如下所示:

<searchComponent name="yourQueryComponent" class="org.apache.solr.handler.component.YourQueryComponent"/>

然后通过扩展 SearchHandler 编写您的搜索处理程序(请求处理程序),并在 SolrConfig.xml 中定义它:

  <requestHandler name="YourRequestHandlerName" class="org.apache.solr.handler.component.YourRequestHandler" default="true">
    <!-- default values for query parameters -->
        <lst name="defaults">
            <str name="echoParams">explicit</str>       
            <int name="rows">1000</int>
            <str name="fl">*</str>
            <str name="version">2.1</str>
        </lst>

        <arr name="components">
            <str>yourQueryComponent</str>
            <str>facet</str>
            <str>mlt</str>
            <str>highlight</str>            
            <str>stats</str>
            <str>debug</str>

        </arr>

  </requestHandler>

然后,当您向 Solr 发送 url 查询时,只需包含附加参数 qt=YourRequestHandlerName,这将导致您的请求处理程序用于该请求。

有关 SearchComponent 的更多信息。
有关 RequestHandler 的更多信息。

Lucene provides the org.apache.lucene.analysis.Analyzer base class which can be used if you want to write your own Analyzer.
You can check out org.apache.lucene.analysis.standard.StandardAnalyzer class that extends Analyzer.

Then, in YourAnalyzer, you'll chain StandardAnalyzer and SnowballAnalyzer by using the filters those analyzers use, like this:

TokenStream result = new StandardFilter(tokenStream);
result = new SnowballFilter(result, stopSet);

Then, in your existing code, you'll be able to construct IndexWriter with your own Analyzer implementation that chains Standard and Snowball filters.

Totally off-topic:
I suppose you'll eventually need to setup your custom way of handling requests. That is already implemented inside Solr.

First write your own Search Component by extending SearchComponent and defining it in SolrConfig.xml, like this:

<searchComponent name="yourQueryComponent" class="org.apache.solr.handler.component.YourQueryComponent"/>

Then write your Search Handler (request handler) by extending SearchHandler, and define it in SolrConfig.xml:

  <requestHandler name="YourRequestHandlerName" class="org.apache.solr.handler.component.YourRequestHandler" default="true">
    <!-- default values for query parameters -->
        <lst name="defaults">
            <str name="echoParams">explicit</str>       
            <int name="rows">1000</int>
            <str name="fl">*</str>
            <str name="version">2.1</str>
        </lst>

        <arr name="components">
            <str>yourQueryComponent</str>
            <str>facet</str>
            <str>mlt</str>
            <str>highlight</str>            
            <str>stats</str>
            <str>debug</str>

        </arr>

  </requestHandler>

Then, when you send url query to Solr, simply include additional parameter qt=YourRequestHandlerName, which will result in your request handler being used for that request.

More about SearchComponents.
More about RequestHandlers.

逆夏时光 2024-10-27 17:52:48

Lucene 提供的 SnowballAnalyzer 已经使用了 StandardTokenizer、StandardFilter、LowerCaseFilter、StopFilter 和 SnowballFilter。所以听起来它完全符合您的要求(StandardAnalyzer 所做的一切,加上雪球词干)。

如果没有,您可以通过组合您想要的任何分词器和 TokenStreams 来轻松构建自己的分析器。

The SnowballAnalyzer provided by Lucene already uses the StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and SnowballFilter. So it sounds like it does exactly what you want (everything StandardAnalyzer does, plus the snowball stemming).

If it didn't, you could build your own analyzer pretty easily by combining whatever tokenizers and TokenStreams you wish.

半透明的墙 2024-10-27 17:52:48

最后我重新安排了程序代码以调用 SnowBallAnalyzer 作为选项。然后通过标准分析器对输出进行索引。

它可以工作并且速度很快,但如果我只需一个分析器就可以完成所有工作,我将重新审视我的代码。

感谢姆博纳奇和阿维。

In the end I rearranged the program code to call the SnowBallAnalyzer as an option. The output is then indexed via the StandardAnalyzer.

It works and is fast but if I can do everything with just one analyzer, I'll revisit my code.

Thanks to mbonaci and Avi.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文