无法让 CJKAnalyzer/Tokenizer 识别日语文本

发布于 2024-12-25 07:58:21 字数 2247 浏览 2 评论 0原文

我正在使用 Lucene.NET，它非常棒。然后研究如何让它搜索亚洲语言。因此，我从 StandardAnalyzer 迁移到 CJKAnalyzer。

这对于韩语（尽管 StandardAnalyzer 对于韩语工作正常！）和中文（没有）工作得很好，但我仍然无法让程序识别日语文本。

就像一个非常小的例子，我编写了一个小型数据库（使用 CJKAnalyzer），其中包含几个单词，然后尝试从数据库中读取：

public void Write(string text, AnalyzerType type)
        {
            Document document = new Document();

            document.Add(new Field(
                "text",
                text,
                Field.Store.YES,
                Field.Index.ANALYZED));

            IndexWriter correct = this.chineseWriter;
            correct.AddDocument(document);            
        }

这就是用于编写的。对于阅读：

public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
        {
            Analyzer analyzer = this.chineseAnalyzer;

            QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
            var query = parser.Parse(text);                        

            // Get the fields.
            TopFieldCollector collector = TopFieldCollector.create(
                new Sort(),
                maxResults,
                false,
                true,
                true,
                false);

            // Then use the searcher.            
            this.searcher.Search(
                query,
                null,
                collector);

            // Holds the results
            List<Document> documents = new List<Document>();

            // Get the top documents.
            foreach (var scoreDoc in collector.TopDocs().scoreDocs)
            {
                var doc = this.searcher.Doc(scoreDoc.doc);
                documents.Add(doc);
            }

            // Send the list of docs back.
            return documents.ToArray();
        }

chineseWriter 只是一个传入 CJKAnalyzer 的 IndexWriter，而 chineseAnalyzer 只是 CJKAnalyzer。

关于为什么日语不起作用有什么建议吗？我发送的输入似乎很公平：

puketto

是我将存储的内容，但无法读取它。 :(

编辑：我错了...中文也不起作用：如果搜索词长于 2 个字符，它就会停止工作。与日语相同。

编辑第 2 部分：我现在看到问题是使用前缀搜索。如果我搜索前 2 个字符并使用星号，那么一旦我超过 2 个字符，它就会停止工作，我想这是因为这种方式。如果我这个词被标记化了？搜索完整术语，然后它确实找到了。在 Lucene.NET 中是否可以使用 CJK 前缀搜索？プ* 可以，但プーケ* 找不到。

原文

i'm working with Lucene.NET and it's great. then worked on how to get it to search asian languages. as such, i moved from the StandardAnalyzer to the CJKAnalyzer.

this works fine for korean (although StandardAnalyzer worked ok for korean!), and chinese (which did not), but i still cannot get the program to recognise japanese text.

just as a very small example, i write a tiny database (using the CJKAnalyzer) with a few words in it, then try and read from the database:

public void Write(string text, AnalyzerType type)
        {
            Document document = new Document();

            document.Add(new Field(
                "text",
                text,
                Field.Store.YES,
                Field.Index.ANALYZED));

            IndexWriter correct = this.chineseWriter;
            correct.AddDocument(document);            
        }

that's for the writing. and for the reading:

public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
        {
            Analyzer analyzer = this.chineseAnalyzer;

            QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
            var query = parser.Parse(text);                        

            // Get the fields.
            TopFieldCollector collector = TopFieldCollector.create(
                new Sort(),
                maxResults,
                false,
                true,
                true,
                false);

            // Then use the searcher.            
            this.searcher.Search(
                query,
                null,
                collector);

            // Holds the results
            List<Document> documents = new List<Document>();

            // Get the top documents.
            foreach (var scoreDoc in collector.TopDocs().scoreDocs)
            {
                var doc = this.searcher.Doc(scoreDoc.doc);
                documents.Add(doc);
            }

            // Send the list of docs back.
            return documents.ToArray();
        }

whereby chineseWriter is just an IndexWriter with the CJKAnalyzer passed in, and chineseAnalyzer is just the CJKAnalyzer.

any advice on why japanese isn't working? the input i send seems fair:

プーケット

is what i will store, but cannot read it. :(

EDIT: I was wrong... Chinese doesn't really work either: it the search term is longer than 2 characters, it stops working. Same as Japanese.

EDIT PART 2: I've now seen that the problem is using the prefix search. If I search for the first 2 characters and use an asterisk, then it works. As soon as I go over 2, then it stops to work. i guess this is because of the way the word is tokenized? If I search for the full term, then it does find it. Is there anyway to use prefix search in Lucene.NET for CJK? プ* will work, but プーケ* will find nothing.

分享到QQ

分享到微博