进行“相关搜索”的方法 功能性

发布于 2024-07-06 04:04:23 字数 305 浏览 7 评论 0原文

我见过一些网站在您执行搜索时列出相关搜索,即它们建议您可能感兴趣的其他搜索查询。

我想知道在中型网站中对此进行建模的最佳方法(没有足够的流量来依靠访客统计数据来推断关系)。 我最初的想法是存储每个唯一查询的前 10 个结果,然后在执行新搜索时查找与前 10 个结果中的一定数量匹配但理想情况下不匹配所有结果的所有历史搜索(匹配所有结果可能建议进行等效搜索,因此作为建议没有那么有用)。

我想有些人以前已经完成了这个功能,并且可能能够提供一些不同方法的想法来实现这一点。 我不一定要寻找一种成功的想法,因为解决方案无疑会根据网站的大小和性质而有很大差异。

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.

I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).

I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

转身以后 2024-07-13 04:04:23

您是否考虑过一个轴上包含关键字与另一个轴上包含文档的矩阵。 一旦找到代表关键字的向量集,找到在初始结果集中找到的关键字集,然后找到一种方法,根据其他关键字引用的文档数量或与初始结果集的交叉次数来对其他关键字进行排名。

have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.

隔纱相望 2024-07-13 04:04:23

我为此尝试了多种不同的方法,并取得了不同程度的成功。 最后,我认为最好的方法高度依赖于正在搜索的域/主题,以及用户如何形成查询。

您关于存储以前的搜索的想法对我来说似乎是合理的。 我很想知道它在实践中是如何工作的(我的意思是,以最真诚的方式——有许多细微差别可能导致这些技术在“现实世界”中失败,特别是当数据稀疏时)。

以下是我过去使用过并在文献中看到的一些技术:

  1. 基于同义词库的方法:为用户使用过的每个术语索引到同义词库,然后使用一些启发式方法来过滤同义词以尽可能向用户显示搜索词。
  2. 词干并搜索:词干搜索词(例如:使用 Porter 词干算法,然后使用词干词代替最初提供的查询,并为用户提供精确搜索他们指定的词的选项(或者执行相反的操作,首先搜索确切的词,然后使用词干来查找第二种方法显然需要对已知字典进行一些预处理,或者您可以在索引术语找到它们时收集术语。)
  3. 链接:解析用户查询找到的结果并从中提取关键术语。前 N 个结果(KEA 是您可以查看关键字提取技术的一个库/算法。 )

I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.

Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).

Here are some techniques I've used in the past, and seen in the literature:

  1. Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
  2. Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
  3. Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文