无法让 CJKAnalyzer/Tokenizer 识别日语文本
我正在使用 Lucene.NET,它非常棒。然后研究如何让它搜索亚洲语言。因此,我从 StandardAnalyzer 迁移到 CJKAnalyzer。
这对于韩语(尽管 StandardAnalyzer 对于韩语工作正常!)和中文(没有)工作得很好,但我仍然无法让程序识别日语文本。
就像一个非常小的例子,我编写了一个小型数据库(使用 CJKAnalyzer),其中包含几个单词,然后尝试从数据库中读取:
public void Write(string text, AnalyzerType type)
{
Document document = new Document();
document.Add(new Field(
"text",
text,
Field.Store.YES,
Field.Index.ANALYZED));
IndexWriter correct = this.chineseWriter;
correct.AddDocument(document);
}
这就是用于编写的。对于阅读:
public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
{
Analyzer analyzer = this.chineseAnalyzer;
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
var query = parser.Parse(text);
// Get the fields.
TopFieldCollector collector = TopFieldCollector.create(
new Sort(),
maxResults,
false,
true,
true,
false);
// Then use the searcher.
this.searcher.Search(
query,
null,
collector);
// Holds the results
List<Document> documents = new List<Document>();
// Get the top documents.
foreach (var scoreDoc in collector.TopDocs().scoreDocs)
{
var doc = this.searcher.Doc(scoreDoc.doc);
documents.Add(doc);
}
// Send the list of docs back.
return documents.ToArray();
}
chineseWriter 只是一个传入 CJKAnalyzer 的 IndexWriter,而 chineseAnalyzer 只是 CJKAnalyzer。
关于为什么日语不起作用有什么建议吗?我发送的输入似乎很公平:
puketto
是我将存储的内容,但无法读取它。 :(
编辑:我错了...中文也不起作用:如果搜索词长于 2 个字符,它就会停止工作。与日语相同。
编辑第 2 部分:我现在看到问题是使用前缀搜索。如果我搜索前 2 个字符并使用星号,那么一旦我超过 2 个字符,它就会停止工作,我想这是因为这种方式。如果我这个词被标记化了?搜索完整术语,然后它确实找到了。在 Lucene.NET 中是否可以使用 CJK 前缀搜索? プ* 可以,但 プーケ* 找不到。
i'm working with Lucene.NET and it's great. then worked on how to get it to search asian languages. as such, i moved from the StandardAnalyzer to the CJKAnalyzer.
this works fine for korean (although StandardAnalyzer worked ok for korean!), and chinese (which did not), but i still cannot get the program to recognise japanese text.
just as a very small example, i write a tiny database (using the CJKAnalyzer) with a few words in it, then try and read from the database:
public void Write(string text, AnalyzerType type)
{
Document document = new Document();
document.Add(new Field(
"text",
text,
Field.Store.YES,
Field.Index.ANALYZED));
IndexWriter correct = this.chineseWriter;
correct.AddDocument(document);
}
that's for the writing. and for the reading:
public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
{
Analyzer analyzer = this.chineseAnalyzer;
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
var query = parser.Parse(text);
// Get the fields.
TopFieldCollector collector = TopFieldCollector.create(
new Sort(),
maxResults,
false,
true,
true,
false);
// Then use the searcher.
this.searcher.Search(
query,
null,
collector);
// Holds the results
List<Document> documents = new List<Document>();
// Get the top documents.
foreach (var scoreDoc in collector.TopDocs().scoreDocs)
{
var doc = this.searcher.Doc(scoreDoc.doc);
documents.Add(doc);
}
// Send the list of docs back.
return documents.ToArray();
}
whereby chineseWriter is just an IndexWriter with the CJKAnalyzer passed in, and chineseAnalyzer is just the CJKAnalyzer.
any advice on why japanese isn't working? the input i send seems fair:
プーケット
is what i will store, but cannot read it. :(
EDIT: I was wrong... Chinese doesn't really work either: it the search term is longer than 2 characters, it stops working. Same as Japanese.
EDIT PART 2: I've now seen that the problem is using the prefix search. If I search for the first 2 characters and use an asterisk, then it works. As soon as I go over 2, then it stops to work. i guess this is because of the way the word is tokenized? If I search for the full term, then it does find it. Is there anyway to use prefix search in Lucene.NET for CJK? プ* will work, but プーケ* will find nothing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我使用 StandardTokenizer。至少对于日语和韩语文本,它能够标记包含 3 个字符或 4 个字符的单词。但唯一担心的是汉字。它确实对中文进行标记,但一次标记 1 个字符。
I use StandardTokenizer. Atleast for Japanese and Korean text it is able to tokenize the words which contains 3 character or 4. But only worry is for Chinese character. It does tokenize the Chinese language but 1 character at a time.