有没有更好的方法来查找搜索引擎代码的集合交集?

发布于 2025-01-03 22:55:49 字数 124 浏览 5 评论 0原文

我一直在编写一个小型搜索引擎,需要找出是否有更快的方法来查找集合交集。目前,我正在使用大多数搜索引擎算法中所解释的排序链表。即对于每个单词,我都有一个按列表排序的文档列表,然后找到列表之间的交集。

该案例的性能分析为

I have been coding up a small search engine and need to find out if there is a faster way to find set intersections. Currently, I am using a Sorted linked list as explained in most search engine algorithms. i.e for every word I have a list of documents sorted in a list and then find the intersection among the lists.

The performance profiling of the case is here.
Any other ideas for a faster set intersection?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

-小熊_ 2025-01-10 22:55:49

一种有效的方法是通过“zig-zag”:

假设您的术语是一个列表T

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

该算法假设高效的getFirstAfter(),它可以为您提供第一个文档符合该术语,并且他的 docId 大于指定的参数。如果没有,它应该返回无穷大。

如果对术语进行排序以使最稀有的术语排在第一位,则该算法将是最有效的。

该算法确保最多 #docs_matching_first_term * #terms 次迭代,但实际上 - 通常迭代次数要少得多。

更多信息可以在本讲义中找到 幻灯片 11-13 [讲座首页的版权]

An efficient way to do it is by "zig-zag":

Assume your terms is a list T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

枯寂 2025-01-10 22:55:49

这是一篇研究论文,其中进行了定量分析,用于比较当前的算法。

Here's a research paper that has a quantitave analysis for comparing current algorithms.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文