查询词消除
在布尔检索模型中,查询由使用不同运算符组合在一起的术语组成。乍一看,连接是最明显的选择,但是当查询长度增长时,糟糕的事情就会发生。使用合取时,召回率显着下降,使用析取时,查全率显着下降(例如,stanford OR University
)。
至于现在我们使用的连接是我们的搜索系统(和布尔检索模型)。如果用户输入一些非常罕见的单词或很长的单词序列,我们就会遇到问题。例如,如果用户输入toyota corolla 4wdautomatic 1995
,我们可能没有。但是,如果我们从查询中删除至少一个单词,我们就会得到这样的文档。据我了解,在向量空间模型中这个问题会自动解决。我们不会根据术语存在的事实来过滤文档,而是根据术语的存在对文档进行排名。
因此,我对布尔检索模型中组合术语的更高级方法以及布尔检索模型中罕见术语消除的方法感兴趣。
In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university
).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995
, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在这里定义排名函数似乎没有极限。您可以定义一个向量,其中 wi 为: 如果第 i 个搜索词未出现在文件中,则为 0;如果出现,则为 1;搜索词 i 在文件中出现的次数;然后,基于例如曼哈顿距离、欧几里德距离等对页面进行排名,并按降序排序,可能剔除距离低于指定匹配容差的结果。
如果要处理更复杂的查询,可以将查询放入 CNF - 例如 (term1 或 term2 或 ... termn) AND (item1 或 item2 或 ... itemk) AND ...,然后相应地重新定义权重 wi 。您可以在每个结果中列出文件中无法匹配的术语...以便用户至少知道它的匹配程度。
我想我真正想说的是,要真正获得适合您的答案,您必须准确定义您愿意接受的有效搜索结果。根据严格的解释,如果缺少任何一项,则查找 A1 和 A2 以及...Am 应该会失败...
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...