Lucene.net 4.8 无法使用重音进行搜索

发布于 2025-01-11 12:16:27 字数 963 浏览 4 评论 0原文

基于堆栈溢出中的一些帮助，我设法创建了一个自定义分析器，但仍然无法解决单词有重音的搜索问题。

public class CustomAnalyzer : Analyzer
{
    LuceneVersion matchVersion;

    public CustomAnalyzer(LuceneVersion p_matchVersion) : base()
    {
        matchVersion = p_matchVersion;
    }
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new KeywordTokenizer(reader);
        TokenStream result = new StopFilter(matchVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);            
        result = new LowerCaseFilter(matchVersion, result); 
        result = new StandardFilter(matchVersion, result);
        result = new ASCIIFoldingFilter(result);
        return new TokenStreamComponents(tokenizer, result);
       
    }
}

这个想法是能够搜索“perez”并找到“Pérez”。使用该分析器，我重新创建了索引并进行了搜索，但仍然没有带重音的单词的结果。

作为 LuceneVersion，我正在使用 LuceneVersion.LUCENE_48

任何帮助将不胜感激。谢谢！

原文

based on some help here in stack overflow I managed to create a custom analyzer, but still cant work around search where a word has an accent.

public class CustomAnalyzer : Analyzer
{
    LuceneVersion matchVersion;

    public CustomAnalyzer(LuceneVersion p_matchVersion) : base()
    {
        matchVersion = p_matchVersion;
    }
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new KeywordTokenizer(reader);
        TokenStream result = new StopFilter(matchVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);            
        result = new LowerCaseFilter(matchVersion, result); 
        result = new StandardFilter(matchVersion, result);
        result = new ASCIIFoldingFilter(result);
        return new TokenStreamComponents(tokenizer, result);
       
    }
}

The idea is to be able to search for "perez" and also find "Pérez". Using that analyzer I recreated the index and searched but still no results for words with accent.

As LuceneVersion I'm using LuceneVersion.LUCENE_48

Any help would be greatly appreciated.
Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一笔一画续写前缘 2025-01-18 12:16:27

最初在 GitHub 上回答，但复制到此处以获取上下文。

不，在同一个分析器中使用多个标记器是无效的，因为有要遵守的严格消费规则。

构建代码分析组件以确保开发人员在键入时遵守这些标记器规则会很棒，例如确保的规则TokenStream 类是密封的或使用密封的 IncrementToken() 方法（欢迎贡献）。不过，我们不太可能在 4.8.0 版本之前添加任何其他代码分析器，除非它们是由社区贡献的，因为这些分析器不会阻止发布。目前，确保自定义分析器遵守规则的最佳方法是使用 Lucene.Net.TestFramework，它还使用多线程、随机区域性和随机文本字符串来打击它们，以确保它们的稳健性。

我构建了一个演示，展示如何在自定义分析器上设置测试： https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo< /a> （以及显示上面的示例如何未能通过测试）。功能分析器仅使用 WhiteSpaceTokenizer 和 ICUFoldingFilter。当然，您可能希望添加额外的测试条件以确保您的自定义分析器满足您的期望，然后您可以尝试不同的分词器并添加或重新排列过滤器，直到找到满足您所有要求的解决方案（以及通过Lucene 的规则）。当然，您可以稍后在发现问题时添加其他条件。

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Icu;
using Lucene.Net.Util;
using System.IO;

namespace LuceneExtensions
{
    public sealed class CustomAnalyzer : Analyzer
    {
        private readonly LuceneVersion matchVersion;

        public CustomAnalyzer(LuceneVersion matchVersion)
        {
            this.matchVersion = matchVersion;
        }

        protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
        {
            // Tokenize...
            Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
            TokenStream result = tokenizer;

            // Filter...
            result = new ICUFoldingFilter(result);

            // Return result...
            return new TokenStreamComponents(tokenizer, result);
        }
    }
}

using Lucene.Net.Analysis;
using NUnit.Framework;

namespace LuceneExtensions.Tests
{
    public class TestCustomAnalyzer : BaseTokenStreamTestCase
    {
        [Test]
        public virtual void TestRemoveAccents()
        {
            Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);

            // removal of latin accents (composed)
            AssertAnalyzesTo(a, "résumé", new string[] { "resume" });

            // removal of latin accents (decomposed)
            AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });

            // removal of latin accents (multi-word)
            AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
        }
    }
}

有关可以使用的测试条件的其他想法，我建议查看 Lucene.Net 的广泛的分析器测试包括 ICU 测试。您还可以参考这些测试，看看是否可以找到与您的构建查询类似的用例（尽管请注意，这些测试没有显示处理对象的 .NET 最佳实践）。

Answered originally on GitHub, but copying here for context.

Nope, it isn't valid to use multiple tokenizers in the same Analyzer, as there are strict consuming rules to adhere to.

It would be great to build code analysis components to ensure developers adhere to these tokenizer rules while typing, such as the rule that ensures TokenStream classes are sealed or use a sealed IncrementToken() method (contributions welcome). It is not likely we will add any additional code analyzers prior to the 4.8.0 release unless they are contributed by the community, though, as these are not blocking the release. For the time being, the best way to ensure custom analyzers adhere to the rules are to test them with Lucene.Net.TestFramework, which also hits them with multiple threads, random cultures, and random strings of text to ensure they are robust.

I built a demo showing how to setup testing on custom analyzers here: https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo (as well as showing how the above example fails the tests). The functioning analyzer just uses a WhiteSpaceTokenizer and ICUFoldingFilter. Of course, you may wish to add additional test conditions to ensure your custom analyzer meets your expectations, and then you can experiment with different tokenizers and adding or rearranging filters until you find a solution that meets all of your requirements (as well as plays by Lucene's rules). And of course, you can then later add additional conditions as you discover issues.

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Icu;
using Lucene.Net.Util;
using System.IO;

namespace LuceneExtensions
{
    public sealed class CustomAnalyzer : Analyzer
    {
        private readonly LuceneVersion matchVersion;

        public CustomAnalyzer(LuceneVersion matchVersion)
        {
            this.matchVersion = matchVersion;
        }

        protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
        {
            // Tokenize...
            Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
            TokenStream result = tokenizer;

            // Filter...
            result = new ICUFoldingFilter(result);

            // Return result...
            return new TokenStreamComponents(tokenizer, result);
        }
    }
}

using Lucene.Net.Analysis;
using NUnit.Framework;

namespace LuceneExtensions.Tests
{
    public class TestCustomAnalyzer : BaseTokenStreamTestCase
    {
        [Test]
        public virtual void TestRemoveAccents()
        {
            Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);

            // removal of latin accents (composed)
            AssertAnalyzesTo(a, "résumé", new string[] { "resume" });

            // removal of latin accents (decomposed)
            AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });

            // removal of latin accents (multi-word)
            AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
        }
    }
}

For other ideas about what test conditions you may use, I suggest having a look at Lucene.Net's extensive analyzer tests including the ICU tests. You may also refer to the tests to see if you can find a similar use case to yours for building queries (although do note that the tests don't show .NET best practices for disposing objects).

回复收藏 0 原文

~没有更多了~