有没有更好的方法来查找搜索引擎代码的集合交集？

发布于 2025-01-03 22:55:49 字数 124 浏览 5 评论 0原文

我一直在编写一个小型搜索引擎，需要找出是否有更快的方法来查找集合交集。目前，我正在使用大多数搜索引擎算法中所解释的排序链表。即对于每个单词，我都有一个按列表排序的文档列表，然后找到列表之间的交集。

该案例的性能分析为

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

-小熊_ 2025-01-10 22:55:49

一种有效的方法是通过“zig-zag”：

假设您的术语是一个列表T：

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

该算法假设高效的getFirstAfter()，它可以为您提供第一个文档符合该术语，并且他的 docId 大于指定的参数。如果没有，它应该返回无穷大。

如果对术语进行排序以使最稀有的术语排在第一位，则该算法将是最有效的。

该算法确保最多 #docs_matching_first_term * #terms 次迭代，但实际上 - 通常迭代次数要少得多。

更多信息可以在本讲义中找到幻灯片 11-13 [讲座首页的版权]

An efficient way to do it is by "zig-zag":

Assume your terms is a list T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

回复收藏 0 原文