为什么一次文档评分允许我们在进行交集时跳过较长列表的部分内容?
这是一道信息检索题。当我们一次进行记录评分时,为什么 DAAT 允许我们跳过较长列表的部分内容。我正在阅读一篇名为“使用图形处理器进行高性能红外查询处理”的研究论文,其中他们只是提到了上述属性,没有任何解释。一个例子将不胜感激。谢谢
This is a information retrieval question. When we do document at a time scoring, why does DAAT allow us to skip over parts of longer lists. I am reading a research paper called Using Graphics Processors for High Perfomance IR Query Processing, in which they just mention the above property without any explanation. AN example will be appreciated. Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
例如,考虑“AND”查询:
您可以想象
Gaga
比 CD 稀有得多。换句话说,Gaga
的发帖列表(或倒排列表)比CD
的发帖列表短得多。让我们假设这两个单词有两个小的发布列表(我只显示 doc_ids,这是这里感兴趣的对象):
在一次文档检索中,我们并行迭代发布列表,寻找匹配的文档查询(在 AND 查询中,仅出现在两个发布列表中的每个文档)。
在这种类型的检索中,我们可以通过了解最罕见的术语(
Gaga
)来跳过文档。这样我们就可以使用它的发布列表作为“枢轴”。要查找的第一个 doc_id 是
2
,然后是10
。请注意,我们可以跳过CD
发布列表中2
和10
之间的所有 doc_ids,因为我们知道它不会匹配任何内容。同样,下一个处理的 doc_id 是1023
。处理1023
时,我们可以跳过至少 10 个文档(从15
到564
之后的内容),因为我们知道它不会匹配任何内容。该算法(用于 AND 查询)基本上是一个数组交集。当你遇到交叉点时,你会处理它。否则你就会一直跳过。
更新:许多实现使用跳过列表来避免在处理倒排时进行比较列表。在上面的示例中,系统可以使用跳过列表“跳转”到 doc_id 接近 10 的
CD
倒排列表的下一个位置。这样就不需要与 <代码>6和<代码>8。Considering a "AND" query, for example:
You can imagine that
Gaga
is much more rare than CD. In other words, the posting list (or inverted list as you will) ofGaga
is much shorter than the one forCD
.Let's assume two small posting lists for the two words (I'm showing only the doc_ids, which are the objects of interest here):
In Document-at-a-time retrieval, we iterate through posting lists in paralell looking for docs that match the query (in a AND query, just every doc that occurs in both posting lists).
In this type of retrieval, we can skip documents by knowing the most rare term (
Gaga
). That way we can use its posting list as a "pivot".The first doc_id to look for is
2
, than is10
. Note that, we can skip all doc_ids between2
and10
inCD
posting lists, because we know it won't match anything. Similarly, the next doc_id processed is1023
. When processing1023
we can skip at least 10 documents (from15
to something after564
), because we know it won't match anything.The algorithm (for the AND query) is basically an array intersection. When you got a intersection you process it. Otherwise you keep skipping.
UPDATE: Many implementations use Skip Lists to avoid doing comparisons while processing inverted lists. In the example above, the system could use the skip list to "jump" to the next position of
CD
inverted list that has a doc_id close to 10. That way it won't need to compare with6
and8
.