Lucene 标准分析器与 Snowball

发布于 2024-09-26 01:31:24 字数 154 浏览 7 评论 0原文

刚刚开始使用 Lucene.Net。我使用标准分析器索引了 100,000 行,运行了一些测试查询,并注意到如果原始术语是单数,则复数查询不会返回结果。我知道雪球分析器增加了词干支持,这听起来不错。不过,我想知道,超过标准的雪球锣是否有任何缺点?我这样做会失去什么吗?还有其他分析仪需要考虑吗?

Just getting started with Lucene.Net. I indexed 100,000 rows using standard analyzer, ran some test queries, and noticed plural queries don't return results if the original term was singular. I understand snowball analyzer adds stemming support, which sounds nice. However, I'm wondering if there are any drawbacks to gong with snowball over standard? Am I losing anything by going with it? Are there any other analyzers out there to consider?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

栀梦 2024-10-03 01:31:24

是的,通过使用 Snowball 等词干分析器,您会丢失有关文本原始形式的信息。有时这会有用,有时则没有。

例如,Snowball 会将“organization”词干转换为“organ”,因此搜索“organization”将返回包含“organ”的结果,而不会产生任何得分惩罚。

这是否适合您取决于您​​的内容以及您支持的查询类型(例如,搜索是否非常基本,或者用户是否非常复杂并使用您的搜索来准确过滤结果)。您可能还想研究不太激进的词干分析器,例如 KStem

Yes, by using a stemmer such as Snowball, you are losing information about the original form of your text. Sometimes this will be useful, sometimes not.

For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty.

Whether or not this is appropriate to you depends on your content, and on the type of queries you are supporting (for example, are the searches very basic, or are users very sophisticated and using your search to accurately filter down the results). You may also want to look into less aggressive stemmers, such as KStem.

很糊涂小朋友 2024-10-03 01:31:24

snowball 分析器会提高您的召回率,因为它比标准分析器更具攻击性。因此,您需要评估您的搜索结果,看看您的数据是否需要增加召回率或精确率

The snowball analyzer will increase your recall, because it is much more aggressive than standard analyzer. So you need to evaluate your search results to see if for your data you need to increase recall or precision.

风追烟花雨 2024-10-03 01:31:24

我刚刚完成了一个执行词形还原的分析器。这与词干提取类似,只不过它使用上下文来确定单词的类型(名词、动词等)并使用该信息来派生词干。它还在索引中保留单词的原始形式。也许我的库对您有用。不过,它需要 Lucene Java,而且我不知道有任何 C#/.NET 词形还原器。

I just finished an analyzer that performs lemmatization. That's similar to stemming, except that it uses context to determine a word's type (noun, verb, etc.) and uses that information to derive the stem. It also keeps the original form of the word in the index. Maybe my library can be of use to you. It requires Lucene Java, though, and I'm not aware of any C#/.NET lemmatizers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文