有没有更好的方法来查找搜索引擎代码的集合交集?
我一直在编写一个小型搜索引擎,需要找出是否有更快的方法来查找集合交集。目前,我正在使用大多数搜索引擎算法中所解释的排序链表。即对于每个单词,我都有一个按列表排序的文档列表,然后找到列表之间的交集。
该案例的性能分析为
I have been coding up a small search engine and need to find out if there is a faster way to find set intersections. Currently, I am using a Sorted linked list as explained in most search engine algorithms. i.e for every word I have a list of documents sorted in a list and then find the intersection among the lists.
The performance profiling of the case is here.
Any other ideas for a faster set intersection?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一种有效的方法是通过“zig-zag”:
假设您的术语是一个列表
T
:该算法假设高效的
getFirstAfter()
,它可以为您提供第一个文档符合该术语,并且他的 docId 大于指定的参数。如果没有,它应该返回无穷大。如果对术语进行排序以使最稀有的术语排在第一位,则该算法将是最有效的。
该算法确保最多
#docs_matching_first_term * #terms
次迭代,但实际上 - 通常迭代次数要少得多。更多信息可以在本讲义中找到 幻灯片 11-13 [讲座首页的版权]
An efficient way to do it is by "zig-zag":
Assume your terms is a list
T
:This algorithm assumes efficient
getFirstAfter()
which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most
#docs_matching_first_term * #terms
iterations, but practically - it will usually be much less iterations.More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
这是一篇研究论文,其中进行了定量分析,用于比较当前的算法。
Here's a research paper that has a quantitave analysis for comparing current algorithms.