Lucene:多词短语作为搜索词

发布于 2024-12-29 16:59:01 字数 1738 浏览 2 评论 0原文

我正在尝试使用 Apache Lucene 制作可搜索的电话/本地企业目录。

我有街道名称、公司名称、电话号码等字段。我遇到的问题是,当我尝试按街道名称包含多个单词(例如“新月”)的街道进行搜索时,不会返回任何结果。但是,如果我尝试仅使用一个单词(例如“crescent”)进行搜索,我就会得到我想要的所有结果。

我使用以下内容对数据进行索引:

String LocationOfDirectory = "C:\\dir\\index";

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);


Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);

w.add(doc);
w.close();

我的搜索工作方式如下:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

我尝试将通配符查询交换为短语查询,首先使用整个字符串,然后将字符串拆分为空白并将它们包装在 BooleanQuery 中,如下所示:

String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
    p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);

然而,这并没有奏效。我尝试使用 KeywordAnalyzer 而不是 StandardAnalyzer,但随后所有其他类型的搜索也停止工作。我尝试用其他字符(+ 和 @)替换空格,并将查询转换为这种形式或从这种形式转换查询,但这仍然不起作用。我认为它不起作用,因为 + 和 @ 是未索引的特殊字符,但我似乎无法找到类似字符的列表。

我开始有点生气,有人知道我做错了什么吗?

I'm trying to make a searchable phone/local business directory using Apache Lucene.

I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.

I'm indexing the data with the following:

String LocationOfDirectory = "C:\\dir\\index";

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);


Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);

w.add(doc);
w.close();

My searches work like this:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:

String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
    p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);

However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and @), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and @ are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.

I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

昔日梦未散 2025-01-05 16:59:01

您没有取回文档的原因是,在索引时您使用的是 StandardAnalyzer,它将标记转换为小写并删除停用词。因此,为您的示例建立索引的唯一术语是“新月”。但是,不会分析通配符查询,因此“the”作为查询的强制部分包含在内。您的场景中的短语查询也是如此。

KeywordAnalyzer 可能不太适合您的用例,因为它将整个字段内容作为单个标记。您可以将 SimpleAnalyzer 用于街道字段 - 它将分割所有非字母字符的输入,然后将它们转换为小写。您还可以考虑将 WhitespaceAnalyzerLowerCaseFilter 结合使用。您需要尝试不同的选项并找出最适合您的数据和用户的选项。

此外,您可以为每个字段使用不同的分析器(例如,使用 PerFieldAnalyzerWrapper)如果更改该字段的分析器会破坏其他搜索。

The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.

KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.

Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

紫竹語嫣☆ 2025-01-05 16:59:01

我发现在不使用 QueryParser 的情况下生成查询的尝试不起作用,因此我停止尝试创建自己的查询并改用 QueryParser。我在网上看到的所有建议都表明您应该在索引期间使用的 QueryParser 中使用相同的分析器,因此我使用 StandardAnalyzer 来构建 QueryParser。

这适用于此示例,因为 StandardAnalyzer 在索引期间从街道“新月”中删除了单词“the”,因此我们无法搜索它,因为它不在索引中。

但是,如果我们选择搜索“Grove Road”,那么开箱即用的功能就会出现问题,即查询将返回包含“Grove”或“Road”的所有结果。通过设置 QueryParser 使其默认操作是 AND 而不是 OR,可以轻松解决此问题。

最终,正确的解决方案如下:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);

Query q = qp.parse("grove road");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.

This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.

However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.

In the end, the correct solution was the following:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);

Query q = qp.parse("grove road");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
星星的轨迹 2025-01-05 16:59:01

@RikSaunderson 的解决方案用于搜索必须出现查询的所有子查询的文档,该解决方案仍在 Lucene 9 中使用。

QueryParser queryParser = new QueryParser(LuceneConstants.CONTENTS, new StandardAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);

@RikSaunderson's solution for searching documents where all subqueries of a query have to occur, is still working with Lucene 9.

QueryParser queryParser = new QueryParser(LuceneConstants.CONTENTS, new StandardAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);
慕烟庭风 2025-01-05 16:59:01

如果你想要一个与街道完全匹配的单词,你可以设置字段“Street”NOT_ANALYZED,它不会过滤停止词“the”。

doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);

If you want an exact words match the street, you could set Field "Street" NOT_ANALYZED which will not filter stop word "the".

doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);
街道布景 2025-01-05 16:59:01

这里不需要使用任何Analyzer,因为 Hibernate 隐式使用 StandardAnalyzer 它将根据空格分割单词,因此这里的解决方案已设置将Analyze设置为NO,它将自动执行多短语搜索

 @Column(name="skill")
    @Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
    @Analyzer(definition="SkillsAnalyzer")
    private String skill;

There is no need of using any Analyzer here coz Hibernate implicitly uses StandardAnalyzer which will split the words based on white spaces so the solution here is set the Analyze to NO it will automatically performs Multi Phrase Search

 @Column(name="skill")
    @Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
    @Analyzer(definition="SkillsAnalyzer")
    private String skill;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文