如何在倒排索引结构中搜索短语查询?

发布于 2024-08-29 02:42:35 字数 299 浏览 9 评论 0原文

如果我们想在倒排索引结构中搜索像“t1 t2 t3”这样的查询(t1,t2,t3必须排队), 我们应该采取哪些措施?

1-首先我们搜索 "t1" 术语并找到包含 "t1" 的所有文档,然后对 "t2" 和 "t3" 执行此操作。然后找到“t1”、“t2”和“t3”位置彼此相邻的文档。

2-首先我们搜索“t1”术语并找到包含“t1”的所有文档,然后在我们找到的所有文档中搜索“t2”,接下来,在结果中,我们找到包含“t3”的文档”。

我有一个完整的倒排索引。我想知道上面哪种方式是优化的,(1)还是(2)?

多谢。

If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure ,
which ways should we do ?

1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .

2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .

I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?

thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

溺深海 2024-09-05 02:42:35

正如 wikipedia 条目很好地解释的那样,

有两个主要变体
倒排索引:创纪录水平
倒排索引(或倒排文件索引
或者只是倒排文件)包含一个列表
每个文档的参考文献
单词。 字级倒排索引(或
完整倒排索引倒排列表
另外还包含以下位置
文档中的每个单词。这
后一种形式提供了更多功能
(如短语搜索),但需要更多
时间和空间被创造。

由于您没有告诉我们您有哪种变体,因此我们无法准确回答您的问题,但考虑每种可能性会有所帮助。

打开和搜索文档通常是一项成本高昂的操作,除非您的文档非常小,因此您希望将其最小化 - 而选项 (2) 并不能真正将其最小化。如果您有一个倒排列表,使用选项(1),您甚至不需要打开任何文档;如果您只有一个倒置文件,您将不可避免地需要打开文档并扫描它们(因为否则您将缺乏确认单词相邻性的信息)——但至少使用选项(1)可以最大限度地减少您必须打开和扫描的文档数量(仅限包含每个单词的文档列表交集中的文档)。

因此,无论哪种情况,选项 (1) 都更有前途(除非您的文档特别小)。

As the wikipedia entry well explains,

There are two main variants of
inverted indexes: A record level
inverted index
(or inverted file index
or just inverted file) contains a list
of references to documents for each
word. A word level inverted index (or
full inverted index or inverted list)
additionally contains the positions of
each word within a document. The
latter form offers more functionality
(like phrase searches), but needs more
time and space to be created.

Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.

To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).

So, in either case, option (1) is more promising (unless your documents are peculiarly small).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文