如何配置 solr / lucene 来执行 levenshtein 编辑距离搜索？

发布于 2024-09-16 12:58:20 字数 1378 浏览 15 评论 0原文

我有一个很长的单词列表，我将其放入一个非常简单的 SOLR / Lucene 数据库中。我的目标是从单项查询列表中找到“相似”单词，其中“相似性”具体理解为 (damerau) levensthein 编辑距离。我知道 SOLR 为拼写建议提供了这样的距离。

在我的 SOLR schema.xml 中，我配置了一个字段类型 string：

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

我用它来定义一个字段，

<field name='term' type='string' indexed='true' stored='true' required='true'/>

我想搜索该字段并根据其编辑返回结果编辑距离。但是，当我在启用调试和解释的情况下针对 SOLR 运行类似 webspace~0.1 的查询时，报告显示在计算分数时考虑了一大堆考虑因素，例如：

"1582":"
1.1353534 = (MATCH) sum of:
  1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
    0.08618848 = queryWeight(term:webpage^0.8148148), product of:
      0.8148148 = boost
      13.172914 = idf(docFreq=1, maxDocs=386954)
      0.008029869 = queryNorm
    13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
      1.0 = tf(termFreq(term:webpage)=1)
      13.172914 = idf(docFreq=1, maxDocs=386954)
      1.0 = fieldNorm(field=term, doc=1581)

显然，对于我的应用程序，术语频率、idf 等毫无意义，因为每个文档仅包含一个术语。我尝试使用拼写建议组件，但未能使其返回实际的相似度分数。

任何人都可以提供如何配置 SOLR 来执行 levensthein / jaro-winkler / n-gram 搜索并返回分数的提示，并且无需执行诸如 tf、idf< 之类的额外操作/code>、boost 等都包含在内？某处有 SOLR 的基本配置示例吗？我发现选项的数量确实令人畏惧。

原文

i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling suggestions.

in my SOLR schema.xml, i have configured a field type string:

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

which i use to define a field

<field name='term' type='string' indexed='true' stored='true' required='true'/>

i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1 against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:

"1582":"
1.1353534 = (MATCH) sum of:
  1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
    0.08618848 = queryWeight(term:webpage^0.8148148), product of:
      0.8148148 = boost
      13.172914 = idf(docFreq=1, maxDocs=386954)
      0.008029869 = queryNorm
    13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
      1.0 = tf(termFreq(term:webpage)=1)
      13.172914 = idf(docFreq=1, maxDocs=386954)
      1.0 = fieldNorm(field=term, doc=1581)

clearly, for my application, term frequencies, idfs and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.

can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾｀听者〃 2024-09-23 12:58:20

如果您使用的是夜间构建，则可以使用 strdist 函数根据编辑距离对结果进行排序：

q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc

更多详细信息此处和此处

If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:

q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc

More details here and here

回复收藏 0 原文

青衫负雪 2024-09-23 12:58:20

Solr/Lucene 似乎不太适合此应用程序。你的情况可能会更好。与 SimMetrics 库。它提供了一套全面的字符串距离计算器，包括。贾罗-温克勒、莱文斯坦等

回复收藏 0 原文

独守阴晴ぅ圆缺 2024-09-23 12:58:20

如何配置 SOLR 来执行 levensthein / jaro-winkler / n-gram
返回分数的搜索，并且不执行其他操作，例如
tf、idf、boost 等都包含在内？

您已经有了一些关于如何获得所需结果的解决方案，但没有一个真正回答您的问题。

q={!func}strdist("webspace",term,edit) 将使用 Levenstein 距离覆盖默认文档评分，q={!func}strdist("webspace",term ,jw) 对 Jaro-Winkler 执行相同的操作。

上面建议的排序在大多数情况下都可以正常工作，但它不会改变评分函数，它只是对使用您想要避免的评分方法获得的结果进行排序。这可能会导致不同的结果，并且组的顺序可能不相同。

要查看哪些最适合，&debugQuery=true 可能就足够了。

回复收藏 0 原文

~没有更多了~

关于作者

旧街凉风

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何配置 solr / lucene 来执行 levenshtein 编辑距离搜索？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如何配置 solr / lucene 来执行 levenshtein 编辑距离搜索？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。