如何在倒排索引结构中搜索短语查询?
如果我们想在倒排索引结构中搜索像“t1 t2 t3”这样的查询(t1,t2,t3必须排队), 我们应该采取哪些措施?
1-首先我们搜索 "t1" 术语并找到包含 "t1" 的所有文档,然后对 "t2" 和 "t3" 执行此操作。然后找到“t1”、“t2”和“t3”位置彼此相邻的文档。
2-首先我们搜索“t1”术语并找到包含“t1”的所有文档,然后在我们找到的所有文档中搜索“t2”,接下来,在结果中,我们找到包含“t3”的文档”。
我有一个完整的倒排索引。我想知道上面哪种方式是优化的,(1)还是(2)?
多谢。
If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure ,
which ways should we do ?
1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .
2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .
I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?
thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 wikipedia 条目很好地解释的那样,
由于您没有告诉我们您有哪种变体,因此我们无法准确回答您的问题,但考虑每种可能性会有所帮助。
打开和搜索文档通常是一项成本高昂的操作,除非您的文档非常小,因此您希望将其最小化 - 而选项 (2) 并不能真正将其最小化。如果您有一个倒排列表,使用选项(1),您甚至不需要打开任何文档;如果您只有一个倒置文件,您将不可避免地需要打开文档并扫描它们(因为否则您将缺乏确认单词相邻性的信息)——但至少使用选项(1)可以最大限度地减少您必须打开和扫描的文档数量(仅限包含每个单词的文档列表交集中的文档)。
因此,无论哪种情况,选项 (1) 都更有前途(除非您的文档特别小)。
As the wikipedia entry well explains,
Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.
To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).
So, in either case, option (1) is more promising (unless your documents are peculiarly small).