Apache lucene 和文本含义
我有一个关于 lucene/ 中搜索过程的问题。 我使用此代码进行搜索
Directory directory = FSDirectory.GetDirectory(@"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
在一个文档中,我为字段设置了“我想要去购物”,在其他文档中我设置了“我想要去购物” ”。
两个句子的意思是一样的!
lucene 有什么好的解决方案来理解句子的含义或标准化气味吗?例如,保存“我想要/想要/去购物”等字段,并使用结果中的正则表达式删除注释。
I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(@"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Lucene 提供了过滤器来标准化单词,甚至映射相似的单词。
PorterStemFilter -
词干提取可以将单词简化为词根。
例如,wanted,want 将简化为根want,并且搜索任何这些单词都将与文档匹配。
然而,想要并没有减少到根本的想要。所以在这种情况下它可能不起作用。
同义词过滤器 -
将帮助您在配置文件中映射类似的单词。
所以wanna可以映射到want,如果您搜索其中任何一个,文档必须匹配。
您需要在分析链中添加过滤器。
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.