无法让 CJKAnalyzer/Tokenizer 识别日语文本

发布于 2024-12-25 07:58:21 字数 2247 浏览 2 评论 0原文

我正在使用 Lucene.NET,它非常棒。然后研究如何让它搜索亚洲语言。因此,我从 StandardAnalyzer 迁移到 CJKAnalyzer。

这对于韩语(尽管 StandardAnalyzer 对于韩语工作正常!)和中文(没有)工作得很好,但我仍然无法让程序识别日语文本。

就像一个非常小的例子,我编写了一个小型数据库(使用 CJKAnalyzer),其中包含几个单词,然后尝试从数据库中读取:

public void Write(string text, AnalyzerType type)
        {
            Document document = new Document();

            document.Add(new Field(
                "text",
                text,
                Field.Store.YES,
                Field.Index.ANALYZED));

            IndexWriter correct = this.chineseWriter;
            correct.AddDocument(document);            
        }

这就是用于编写的。对于阅读:

public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
        {
            Analyzer analyzer = this.chineseAnalyzer;

            QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
            var query = parser.Parse(text);                        

            // Get the fields.
            TopFieldCollector collector = TopFieldCollector.create(
                new Sort(),
                maxResults,
                false,
                true,
                true,
                false);

            // Then use the searcher.            
            this.searcher.Search(
                query,
                null,
                collector);

            // Holds the results
            List<Document> documents = new List<Document>();

            // Get the top documents.
            foreach (var scoreDoc in collector.TopDocs().scoreDocs)
            {
                var doc = this.searcher.Doc(scoreDoc.doc);
                documents.Add(doc);
            }

            // Send the list of docs back.
            return documents.ToArray();
        }

chineseWriter 只是一个传入 CJKAnalyzer 的 IndexWriter,而 chineseAnalyzer 只是 CJKAnalyzer。

关于为什么日语不起作用有什么建议吗?我发送的输入似乎很公平:

puketto

是我将存储的内容,但无法读取它。 :(

编辑:我错了...中文也不起作用:如果搜索词长于 2 个字符,它就会停止工作。与日语相同。

编辑第 2 部分:我现在看到问题是使用前缀搜索。如果我搜索前 2 个字符并使用星号,那么一旦我超过 2 个字符,它就会停止工作,我想这是因为这种方式。如果我这个词被标记化了?搜索完整术语,然后它确实找到了。在 Lucene.NET 中是否可以使用 CJK 前缀搜索? プ* 可以,但 プーケ* 找不到。

i'm working with Lucene.NET and it's great. then worked on how to get it to search asian languages. as such, i moved from the StandardAnalyzer to the CJKAnalyzer.

this works fine for korean (although StandardAnalyzer worked ok for korean!), and chinese (which did not), but i still cannot get the program to recognise japanese text.

just as a very small example, i write a tiny database (using the CJKAnalyzer) with a few words in it, then try and read from the database:

public void Write(string text, AnalyzerType type)
        {
            Document document = new Document();

            document.Add(new Field(
                "text",
                text,
                Field.Store.YES,
                Field.Index.ANALYZED));

            IndexWriter correct = this.chineseWriter;
            correct.AddDocument(document);            
        }

that's for the writing. and for the reading:

public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
        {
            Analyzer analyzer = this.chineseAnalyzer;

            QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
            var query = parser.Parse(text);                        

            // Get the fields.
            TopFieldCollector collector = TopFieldCollector.create(
                new Sort(),
                maxResults,
                false,
                true,
                true,
                false);

            // Then use the searcher.            
            this.searcher.Search(
                query,
                null,
                collector);

            // Holds the results
            List<Document> documents = new List<Document>();

            // Get the top documents.
            foreach (var scoreDoc in collector.TopDocs().scoreDocs)
            {
                var doc = this.searcher.Doc(scoreDoc.doc);
                documents.Add(doc);
            }

            // Send the list of docs back.
            return documents.ToArray();
        }

whereby chineseWriter is just an IndexWriter with the CJKAnalyzer passed in, and chineseAnalyzer is just the CJKAnalyzer.

any advice on why japanese isn't working? the input i send seems fair:

プーケット

is what i will store, but cannot read it. :(

EDIT: I was wrong... Chinese doesn't really work either: it the search term is longer than 2 characters, it stops working. Same as Japanese.

EDIT PART 2: I've now seen that the problem is using the prefix search. If I search for the first 2 characters and use an asterisk, then it works. As soon as I go over 2, then it stops to work. i guess this is because of the way the word is tokenized? If I search for the full term, then it does find it. Is there anyway to use prefix search in Lucene.NET for CJK? プ* will work, but プーケ* will find nothing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

温柔女人霸气范 2025-01-01 07:58:21

我使用 StandardTokenizer。至少对于日语和韩语文本,它能够标记包含 3 个字符或 4 个字符的单词。但唯一担心的是汉字。它确实对中文进行标记,但一次标记 1 个字符。

I use StandardTokenizer. Atleast for Japanese and Korean text it is able to tokenize the words which contains 3 character or 4. But only worry is for Chinese character. It does tokenize the Chinese language but 1 character at a time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文