如何处理短语查询和术语分组

发布于 2024-11-17 00:38:53 字数 1046 浏览 1 评论 0原文

我是 Lucene 的新手，我的项目是提供一组专门的搜索小册子。我正在使用 Lucene Java 3.1。

基本思想是帮助人们知道在哪里寻找信息（而不是大而干燥的）小册子，通过查阅索引来找出哪些小册子和页码与他们的查询匹配。我的索引中的每个文档代表其中一本小册子中的特定页面。

到目前为止，我已经能够成功地从小册子中抓取原始文本，将其插入索引，然后使用 StandardAnalyzer 进行查询即可结束。

这是我的一般问题：
对索引的许多查询将涉及搜索索引中提到的地名小册子。有些地名使用符号变体。例如，在正文中它在一页上将被称为“Ship Creek”，但在其他地方的地图图表中可能会被列为“Ship Cr”。甚至“Ship Ck”。我需要知道的是如何将两个连续的单词视为一个术语并将符号变体添加为同义词。

我的目标当然是搜索任何变体并捕获所有出现的情况。如果我搜索 (Ship AND (Cr Ck Creek))，这不会给我想要的结果，因为 [ship] 和 [cr]/[ck]/[creek] 之间可能会出现其他单词，从而导致误报。

因此，简而言之，我可能仍然需要 StandardAnalyzer 提供的基本内容，但需要使用特定术语分组来将地名作为完整术语发出，并可能插入同义词来涵盖变体。

例如，文本“...允许从 Ship Creek 上游口到...”将结果是标记[允许]、[嘴]、[船溪]、[上游]。也许通过 TokenFilter 一起这样，[ship creek] 一词将扩展为 [ship creek][ship ck][ship cr]。

作为奖励，最好能处理更棘手的文本“..除了船、鸟和坎贝尔溪的限制是......”，如[例外]、[船溪]、[鸟溪]、 [坎贝尔溪]、[地点]、[限制]。

这似乎是一个非常基本的用例，但我不清楚如何使用 Lucene contrib 或 SOLR 中的现有组件来完成此任务。检测和合并是否应该在某种 TokenFilter 中完成？我需要自定义分析器实现吗？

某些术语分组可能可以启发式完成 []，[creek] 是 [ creek] 但如果有帮助的话，我还有文中提到的地点的详尽列表。

感谢您提供的任何帮助。

原文

I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.

The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.

So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.

So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.

My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.

So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.

For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].

As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].

This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?

Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.

Thanks for any help you can provide.

分享到QQ

分享到微博