为什么一次文档评分允许我们在进行交集时跳过较长列表的部分内容?

发布于 2024-11-06 05:13:39 字数 123 浏览 0 评论 0原文

这是一道信息检索题。当我们一次进行记录评分时,为什么 DAAT 允许我们跳过较长列表的部分内容。我正在阅读一篇名为“使用图形处理器进行高性能红外查询处理”的研究论文,其中他们只是提到了上述属性,没有任何解释。一个例子将不胜感激。谢谢

This is a information retrieval question. When we do document at a time scoring, why does DAAT allow us to skip over parts of longer lists. I am reading a research paper called Using Graphics Processors for High Perfomance IR Query Processing, in which they just mention the above property without any explanation. AN example will be appreciated. Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怼怹恏 2024-11-13 05:13:39

例如,考虑“AND”查询:

"Gaga AND CD"

您可以想象 Gaga 比 CD 稀有得多。换句话说,Gaga 的发帖列表(或倒排列表)比 CD 的发帖列表短得多。

让我们假设这两个单词有两个小的发布列表(我只显示 doc_ids,这是这里感兴趣的对象):

Gaga -> [2, 10, 1023, 2030]
CD   -> [1, 2, 6, 8, 15, 32, 43, 52, 92, 115, 326, 401, 560, 564, ... , 1924, 2030, ...]

在一次文档检索中,我们并行迭代发布列表,寻找匹配的文档查询(在 AND 查询中,仅出现在两个发布列表中的每个文档)。

在这种类型的检索中,我们可以通过了解最罕见的术语(Gaga)来跳过文档。这样我们就可以使用它的发布列表作为“枢轴”。
要查找的第一个 doc_id 是 2,然后是 10。请注意,我们可以跳过 CD 发布列表中 210 之间的所有 doc_ids,因为我们知道它不会匹配任何内容。同样,下一个处理的 doc_id 是 1023。处理 1023 时,我们可以跳过至少 10 个文档(从 15564 之后的内容),因为我们知道它不会匹配任何内容。

该算法(用于 AND 查询)基本上是一个数组交集。当你遇到交叉点时,你会处理它。否则你就会一直跳过。

更新:许多实现使用跳过列表来避免在处理倒排时进行比较列表。在上面的示例中,系统可以使用跳过列表“跳转”到 doc_id 接近 10 的 CD 倒排列表的下一个位置。这样就不需要与 <代码>6和<代码>8。

Considering a "AND" query, for example:

"Gaga AND CD"

You can imagine that Gaga is much more rare than CD. In other words, the posting list (or inverted list as you will) of Gaga is much shorter than the one for CD.

Let's assume two small posting lists for the two words (I'm showing only the doc_ids, which are the objects of interest here):

Gaga -> [2, 10, 1023, 2030]
CD   -> [1, 2, 6, 8, 15, 32, 43, 52, 92, 115, 326, 401, 560, 564, ... , 1924, 2030, ...]

In Document-at-a-time retrieval, we iterate through posting lists in paralell looking for docs that match the query (in a AND query, just every doc that occurs in both posting lists).

In this type of retrieval, we can skip documents by knowing the most rare term (Gaga). That way we can use its posting list as a "pivot".
The first doc_id to look for is 2, than is 10. Note that, we can skip all doc_ids between 2 and 10 in CD posting lists, because we know it won't match anything. Similarly, the next doc_id processed is 1023. When processing 1023 we can skip at least 10 documents (from 15 to something after 564), because we know it won't match anything.

The algorithm (for the AND query) is basically an array intersection. When you got a intersection you process it. Otherwise you keep skipping.

UPDATE: Many implementations use Skip Lists to avoid doing comparisons while processing inverted lists. In the example above, the system could use the skip list to "jump" to the next position of CD inverted list that has a doc_id close to 10. That way it won't need to compare with 6 and 8.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文