如何配置 solr / lucene 来执行 levenshtein 编辑距离搜索?

发布于 2024-09-16 12:58:20 字数 1378 浏览 10 评论 0原文

我有一个很长的单词列表,我将其放入一个非常简单的 SOLR / Lucene 数据库中。我的目标是从单项查询列表中找到“相似”单词,其中“相似性”具体理解为 (damerau) levensthein 编辑距离。我知道 SOLR 为拼写建议提供了这样的距离。

在我的 SOLR schema.xml 中,我配置了一个字段类型 string

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

我用它来定义一个字段,

<field name='term' type='string' indexed='true' stored='true' required='true'/>

我想搜索该字段并根据其编辑返回结果编辑距离。但是,当我在启用调试和解释的情况下针对 SOLR 运行类似 webspace~0.1 的查询时,报告显示在计算分数时考虑了一大堆考虑因素,例如:

"1582":"
1.1353534 = (MATCH) sum of:
  1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
    0.08618848 = queryWeight(term:webpage^0.8148148), product of:
      0.8148148 = boost
      13.172914 = idf(docFreq=1, maxDocs=386954)
      0.008029869 = queryNorm
    13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
      1.0 = tf(termFreq(term:webpage)=1)
      13.172914 = idf(docFreq=1, maxDocs=386954)
      1.0 = fieldNorm(field=term, doc=1581)

显然,对于我的应用程序,术语频率、idf 等毫无意义,因为每个文档仅包含一个术语。我尝试使用拼写建议组件,但未能使其返回实际的相似度分数。

任何人都可以提供如何配置 SOLR 来执行 levensthein / jaro-winkler / n-gram 搜索并返回分数的提示,并且无需执行诸如 tfidf< 之类的额外操作/code>、boost 等都包含在内?某处有 SOLR 的基本配置示例吗?我发现选项的数量确实令人畏惧。

i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling suggestions.

in my SOLR schema.xml, i have configured a field type string:

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

which i use to define a field

<field name='term' type='string' indexed='true' stored='true' required='true'/>

i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1 against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:

"1582":"
1.1353534 = (MATCH) sum of:
  1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
    0.08618848 = queryWeight(term:webpage^0.8148148), product of:
      0.8148148 = boost
      13.172914 = idf(docFreq=1, maxDocs=386954)
      0.008029869 = queryNorm
    13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
      1.0 = tf(termFreq(term:webpage)=1)
      13.172914 = idf(docFreq=1, maxDocs=386954)
      1.0 = fieldNorm(field=term, doc=1581)

clearly, for my application, term frequencies, idfs and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.

can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

倾`听者〃 2024-09-23 12:58:20

如果您使用的是夜间构建,则可以使用 strdist 函数根据编辑距离对结果进行排序:

q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc

更多详细信息 此处此处

If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:

q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc

More details here and here

青衫负雪 2024-09-23 12:58:20

Solr/Lucene 似乎不太适合此应用程序。你的情况可能会更好。与 SimMetrics 库 。它提供了一套全面的字符串距离计算器,包括。贾罗-温克勒、莱文斯坦等

Solr/Lucene doesn't appear to be a good fit for this application. You are likely better off. with SimMetrics library . It offers a comprehensive set of string-distance calculators incl. Jaro-Winkler, Levenstein etc.

独守阴晴ぅ圆缺 2024-09-23 12:58:20

如何配置 SOLR 来执行 levensthein / jaro-winkler / n-gram
返回分数的搜索,并且不执行其他操作,例如
tf、idf、boost 等都包含在内?

您已经有了一些关于如何获得所需结果的解决方案,但没有一个真正回答您的问题。

q={!func}strdist("webspace",term,edit) 将使用 Levenstein 距离覆盖默认文档评分,q={!func}strdist("webspace",term ,jw) 对 Jaro-Winkler 执行相同的操作。

上面建议的排序在大多数情况下都可以正常工作,但它不会改变评分函数,它只是对使用您想要避免的评分方法获得的结果进行排序。这可能会导致不同的结果,并且组的顺序可能不相同。

要查看哪些最适合,&debugQuery=true 可能就足够了。

how to configure SOLR to perform levensthein / jaro-winkler / n-gram
searches with scores returned and without doing additional stuff like
tf, idf, boost and so included?

You've got some solutions of how to obtain the desired results but none actually answeres your question.

q={!func}strdist("webspace",term,edit) will overwrite the default document scoring with the Levenstein distance and q={!func}strdist("webspace",term,jw) does the same for Jaro-Winkler.

The sorting suggested above will work fine in most cases but it doesn't change the scoring function, it just sorts the results obtained with the scoring method you want to avoid. This might lead to different results and the order of the groups might not be the same.

To see which ones would fit best a &debugQuery=true might be enough.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文