当前位置：文江博客话题详情

短语查询和使用 shingle 过滤器有什么区别？

发布于 2024-12-21 23:11:14 字数 1217 浏览 4 评论 0原文

我目前正在使用 lucene 索引网页。目的是能够快速提取哪个页面包含某个表达（通常是 1、2 或 3 个单词），以及该页面中还包含哪些其他单词（或其中 1 到 3 个单词的组）。这将用于构建/丰富/更改同义词库（固定词汇）。

从我找到的文章来看，问题似乎是找到 n-grams （或卵石）。

Lucene 有一个 ShingleFilter ，一个ShingleMatrixFilter,和一个ShingleAnalyzerWrapper,这似乎与此任务相关。

从这个演示中，我了解到Lucene还可以搜索以固定数字分隔的术语单词（称为 slops）。此处提供了一个示例。

但是，我不清楚这些方法之间的区别？它们是根本不同的，还是您必须做出的性能/索引大小选择？

ShingleMatrixFilter和ShingleFilter有什么区别？

希望 Lucene 大师能够找到这个问题，并回答;-)！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可可 2024-12-28 23:11:14

使用短语与木瓦之间的差异主要涉及性能和评分。

在索引中存在单个单词的典型情况下使用短语查询（例如“foo bar”）时，短语查询必须遍历“foo”和“bar”的倒排索引并找到包含这两个术语的文档，然后在每个文档中遍历他们的位置列表，以查找“foo”出现在“bar”之前的位置。

这对性能和评分都有一定的成本：

必须对位置（.prx）进行索引和搜索，这就像倒排索引的附加“维度”，这会增加索引和搜索时间
因为只有单个术语出现在倒排索引中，没有计算出真正的“短语 IDF”（这可能不会影响您）。因此，这是根据 IDF 项的总和进行近似计算的。

另一方面，如果您使用 shingles，您也会对单词 n-gram 进行索引，换句话说，如果您的 shingles 大小达到 2，那么索引中还会有诸如“foo bar”之类的术语。这意味着对于这个短语查询，它将被解析为简单的 TermQuery，而不使用任何位置列表。由于它现在是一个“真正的术语”，因此 IDF 短语将是准确的，因为我们确切地知道这个“术语”存在多少文档。

但使用 shingles 也有一些成本：

增加术语词典、术语索引和发布列表大小，尽管这可能是一个公平的权衡，特别是如果您完全使用 Field.setIndexOptions 完全禁用位置。
在索引分析阶段会产生一些额外成本：尽管ShingleFilter 优化得很好并且速度相当快。
没有明显的方法来计算“草率短语查询”或不精确的短语匹配，尽管这可以近似，例如对于带有大小为 2 的带状疱疹的“foo bar baz”短语，您将有两个标记：foo_bar，bar_baz 并且您可以实现通过 lucene 的一些其他查询（如 BooleanQuery）进行搜索以获得不精确的近似值。

一般来说，使用 Shingles 或 CommonGrams 之类的词索引只是一种权衡（相当专业），以降低位置查询的成本或增强短语评分。

但是这个东西有现实世界的用例，这里有一个很好的例子：
http://www.hathitrust.org/blogs/大规模搜索/慢速查询和常用词-第 2 部分

The differences between using phrase versus shingle mainly involve performance and scoring.

When using phrase queries (say "foo bar") in the typical case where single words are in the index, phrase queries have to walk the inverted index for "foo" and for "bar" and find the documents that contain both terms, then walk their positions lists within each one of those documents to find the places where "foo" appeared right before "bar".

This has some cost to both performance and scoring:

Positions (.prx) must be indexed and searched, this is like an additional "dimension" to the inverted index which will increase indexing and search times
Because only individual terms appear in the inverted index, there is no real "phrase IDF" computed (this might not affect you). So instead this is approximated based on the sum of the term IDFs.

On the other hand, if you use shingles, you are also indexing word n-grams, in other words, if you are shingling up to size 2, you will also have terms like "foo bar" in the index. This means for this phrase query, it will be parsed as a simple TermQuery, without using any positions lists. And since its now a "real term", the phrase IDF will be exact, because we know exactly how many documents this "term" exists.

But using shingles has some costs as well:

Increased term dictionary, term index, and postings list sizes, though this might be a fair tradeoff especially if you completely disable positions entirely with Field.setIndexOptions.
Some additional cost during the analysis phase of indexing: although ShingleFilter is optimized nicely and is pretty fast.
No obvious way to compute "sloppy phrase queries" or inexact phrase matches, although this can be approximated, e.g. for a phrase of "foo bar baz" with shingles of size 2, you will have two tokens: foo_bar, bar_baz and you could implement the search via some of lucene's other queries (like BooleanQuery) for an inexact approximation.

In general, indexing word-ngrams with things like Shingles or CommonGrams is just a tradeoff (fairly expert), to reduce the cost of positional queries or to enhance phrase scoring.

But there are real-world use cases for this stuff, a good example is available here:
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

回复收藏 0 原文

~没有更多了~