Lucene.NET 的独立于文化的词干分析器/分析器
我们目前正在开发一个支持全文搜索的应用程序,Lucene.NET 是我们选择的武器。预计应用程序将由来自不同国家/地区的人们使用,因此 Lucene.NET 必须能够同样出色地搜索俄语、英语和其他文本。
是否有通用且独立于文化的词干分析器和分析器可以满足我们的需求?我知道最终我们必须使用特定于文化的方法,但我们希望使用这种可能快速但肮脏的方法来启动和运行。
We're currently developing a full-text-search-enabled app and we Lucene.NET is our weapon of choice. What's expected is that an app will be used by people from different countries, so Lucene.NET has to be able to search across Russian, English and other texts equally well.
Are there any universal and culture-independent stemmers and analyzers to suit our needs? I understand that eventually we'd have to use culture-specific ones, but we want to get up and running with this potentially quick and dirty approach.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
鉴于英语和俄语的拼写、语法和字符集显着不同,任何试图同时执行这两种操作的词干分析器要么非常大,要么性能很差(很可能两者兼而有之)。
最好为每种语言使用词干分析器,并根据 UI 线索(用于查询的语言)或通过显式选择来选择要使用的词干分析器。
话虽如此,任何俄语文本都不太可能正确匹配英语搜索词,反之亦然。
这听起来像是多一点业务分析比编写代码更有帮助的情况。
Given that the spelling, grammar and character sets of English and Russian are significantly different, any stemmer which tried to do both would either be massively large or poorly performant (most likely both).
It would probably be much better to use a stemmer for each language, and pick which one to use based on either UI clues (what language is being used to query) or by explicit selection.
Having said that, it's unlikely that any Russian text will match an English search term correctly or vice-versa.
This sounds like a case where a little more business analysis would help more than code.
不存在独立于语言的词干分析器这样的东西。事实上,词干提取是否能提高检索性能因语言而异。您能做的最好的事情就是对文档和查询进行语言猜测,然后分派给适当的分析器/词干分析器。
不过,对短查询的语言猜测是困难(就像最先进的那样,不是快速'n'脏)。如果您的查询很短,您可能希望在查询上使用简单的空白分析器,而不是阻止任何内容。
There is no such a thing as a language-independent stemmer. In fact, whether stemming improves retrieval performance varies per language. The best you can do is language guessing on the documents and queries, then dispatch to the appropriate analyzer/stemmer.
Language guessing on short queries is hard, though (as in state-of-the-art, not quick 'n' dirty). If your queries are short, you might want use a simple whitespace analyzer on the queries and not stem anything.