斯芬克斯和“你的意思是……?”建议想法。它会起作用吗?

发布于 2024-10-17 12:20:57 字数 988 浏览 6 评论 0原文

我正在尝试想出最快的方法来提出搜索建议。起初我认为 Levenstein UDF 函数与 mysql 表相结合就可以完成这项工作。但是使用levenshtein,mysql将不得不遍历表中的每一行(大量的单词),这将使查询非常慢。

现在我最近安装并开始使用 Sphinx (http://sphinxsearch.com/) 进行全文搜索,主要是因为它的性能以及 mysql 与 SphinxSE 的紧密集成。

所以我问自己是否可以使用 sphinx 实现一种“你的意思是”算法来以某种方式提高性能,我想我找到了一个简单的算法。 基本上我采取了所有我想要纠正的关键字,在每个字母之间放置一个空格,然后将其放入 sphinx 索引中。如果这个词是“keyword”,它就会变成“keyword d”。现在,当用户输入一个单词时,我将其拆分为字母,并在 sphinx 索引中搜索与所提供的任何字母相匹配的记录(我只需要一个)。最好的部分是 sphinx 非常擅长计算匹配行的相关性(权重),因此最佳匹配始终具有最大的权重(我认为)。它还考虑了单词(在我的例子中是字母)的位置,因此最佳匹配将按该顺序排列。

通过 sphinx 查询,我在关键字列表中获得了最相似的单词。然后我用 php 使用扩展的 Levenshtain 距离进行检查,该距离解释了重​​新排列的字母 https: //en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance 。如果字符串距离小于 2(并且!= 0),则建议该单词。否则不要建议任何事情。

我的想法有问题吗?有什么我没想到的吗? sphinx 查询是否会出现任何预期的故障,以及 sphinx 相关性计算是否会出现无法给出最佳匹配的怪癖?如果我有什么地方弄错了,请纠正我。

I'm trying to come up with the fastest way to make search suggestions. At first I thought a Levenstein UDF function combined with a mysql table would do the job. But using levenshtein, mysql would have to go over every row in the table (tons of words) which would make the query really slow.

Now I recently installed and started to use Sphinx (http://sphinxsearch.com/) for fulltext searching mainly because of its performance and tight mysql integration with SphinxSE.

So I asked myself if I can implement a "did you mean" algorithm using sphinx to boost performance somehow, and I think I found a simple one.
Basically i take all the keywords I want to correct, put a space between each letter, then put it in the sphinx index. If the word is 'keyword' it becomes 'k e y w o r d'. Now when the user enters a word I split it in to letters and search in the sphinx index for a record (I just need one) that matches any of the letters provided. The best part is that sphinx is very good on calculating relevance (weight) of the matched rows, so the best match will always have the biggest weight (I think). It also accounts for word (letters in my case) positions so the best match will be in that order.

With the sphinx query I get the most similar word in my keywords list. Then I check it with php using the extended Levenshtain distance which accounts for rearranged letters https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance . If the string distance is lower than 2 (and != 0) then suggest the word. Otherwise don't suggest anything.

Is there a problem with my idea? Something I didn't think of? Any expected glitches with the sphinx query, and quirks with the sphinx relevance calculation which woudn't give the best match? Please correct me if I'm mistaking somewhere.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

稳稳的幸福 2024-10-24 12:20:57

我看不出你的想法有什么问题。大胆试试吧。只是指出,只有当您想覆盖与 LD 非常相似的内置行为时,您的方法才相关。

例如,对于 sphinx 1.10-beta,您可以指定 min_infix_len 和 Expand_keywords 并使用 sphinx 的内置加权方法(BM25 和一些专有代码)以获得良好的结果。 http://sphinxsearch.com/blog/2010/ 08/17/how-sphinx-relevance-ranking-works/

不要忘记对这些查询进行内存缓存,并创建一个预热脚本。

I can't see a problem with your idea. Go for it. Just to point out that your method is only relevant if you want to override the builtin behaviour that is very similar to LD.

For example, with sphinx 1.10-beta, you can specify min_infix_len and expand_keywords and use sphinx's builtin weighting methods (BM25 and some proprietary code) for good results. http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/

Don't forget to memcache these queries, and create a warm-up script.

┾廆蒐ゝ 2024-10-24 12:20:57

您可以只记录输入的每个搜索查询以及用户输入的下一个搜索查询。

让我们假设很多用户搜索 rhinosorous 但实际上是指 rhinoceros。因为用户会更正他们的查询,这意味着将会有很多 rhinosorous 查询,并将 rhinoceros 作为下一个查询。

您可以选择如下建议:

SELECT id, query, next_query, COUNT(id) AS count FROM queries GROUP BY query ORDER BY COUNT(id) DESC

如果顶部结果的计数占该关键字的所有查询的百分比较高,则显示一条消息。

我没有测试过这个,这只是一个想法。

You could just log every search query that's entered, along with a next search query that the user enters.

Lets assume that lots of users search for rhinosorous but actually mean rhinoceros. Because users will correct their query, this will mean there will be a lot of rhinosorous queries with rhinoceros as the next query.

You can select suggestions like this:

SELECT id, query, next_query, COUNT(id) AS count FROM queries GROUP BY query ORDER BY COUNT(id) DESC

If the top result has a count that's a high % of all queries for that keyword, display a message.

I haven't tested this, its just an idea.

酒绊 2024-10-24 12:20:57

我认为您阅读 Andrew Aksyonoff(Sphinx 的作者)对通过 Sphinx 实现此任务的看法会很有趣 - http://habrahabr.ru/blogs/sphinx/61807/(使用翻译器从俄语翻译)

I think it will be interesting to you to read what Andrew Aksyonoff (author of Sphinx) thinks about implementation of this task via Sphinx - http://habrahabr.ru/blogs/sphinx/61807/ (use translator to translate from russian)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文