Lucene 和特殊字符

发布于 2024-08-31 07:55:17 字数 1232 浏览 4 评论 0原文

我正在使用 Lucene.Net 2.0 来索引数据库表中的某些字段。其中一个字段是允许使用特殊字符的“名称”字段。当我执行搜索时,它没有找到包含特殊字符术语的文档。

我这样索引我的字段:

Directory DALDirectory = FSDirectory.GetDirectory(@"C:\Indexes\Name", false);
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(DALDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();
doc.Add(new Field("Name", "Test (Test)", Field.Store.YES, Field.Index.TOKENIZED));
indexWriter.AddDocument(doc);

indexWriter.Optimize();
indexWriter.Close();

并且我执行以下操作进行搜索:

value = value.Trim().ToLower();
value = QueryParser.Escape(value);

Query searchQuery = new TermQuery(new Term(field, value));
Searcher searcher = new IndexSearcher(DALDirectory);

TopDocCollector collector = new TopDocCollector(searcher.MaxDoc());
searcher.Search(searchQuery, collector);
ScoreDoc[] hits = collector.TopDocs().scoreDocs;

如果我搜索字段为“名称”、值为“测试”,它会找到该文档。如果我执行与“名称”相同的搜索和与“测试(测试)”相同的值,则它找不到该文档。

更奇怪的是,如果我删除 QueryParser.Escape 行,搜索 GUID(当然,其中包含连字符),它会找到 GUID 值匹配的文档,但使用“Test(测试)”的值执行相同的搜索' 仍然没有结果。

我不确定我做错了什么。我正在使用 QueryParser.Escape 方法来转义特殊字符,并存储字段并通过 Lucene.Net 的示例进行搜索。

有什么想法吗?

I am using Lucene.Net 2.0 to index some fields from a database table. One of the fields is a 'Name' field which allows special characters. When I perform a search, it does not find my document that contains a term with special characters.

I index my field as such:

Directory DALDirectory = FSDirectory.GetDirectory(@"C:\Indexes\Name", false);
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(DALDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();
doc.Add(new Field("Name", "Test (Test)", Field.Store.YES, Field.Index.TOKENIZED));
indexWriter.AddDocument(doc);

indexWriter.Optimize();
indexWriter.Close();

And I search doing the following:

value = value.Trim().ToLower();
value = QueryParser.Escape(value);

Query searchQuery = new TermQuery(new Term(field, value));
Searcher searcher = new IndexSearcher(DALDirectory);

TopDocCollector collector = new TopDocCollector(searcher.MaxDoc());
searcher.Search(searchQuery, collector);
ScoreDoc[] hits = collector.TopDocs().scoreDocs;

If I perform a search for field as 'Name' and value as 'Test', it finds the document. If I perform the same search as 'Name' and value as 'Test (Test)', then it does not find the document.

Even more strange, if I remove the QueryParser.Escape line do a search for a GUID (which, of course, contains hyphens) it finds documents where the GUID value matches, but performing the same search with the value as 'Test (Test)' still yields no results.

I am unsure what I am doing wrong. I am using the QueryParser.Escape method to escape the special characters and am storing the field and searching by the Lucene.Net's examples.

Any thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

债姬 2024-09-07 07:55:17

StandardAnalyzer 在索引期间去除特殊字符。您可以传入显式停用词列表(不包括您想要的停用词)。

StandardAnalyzer strips out the special characters during indexing. You can pass in a list of explicit stopwords (excluding the ones you want in).

阿楠 2024-09-07 07:55:17

在建立索引时,您已经对该字段进行了标记。因此,您的输入字符串创建两个标记“test”和“test”。对于搜索,您正在手动构建查询,即使用 TermQuery 而不是 QueryParser,后者会标记字段。

对于整个匹配,您需要索引字段 UN_TOKENIZED。这里,输入字符串被视为单个标记。单个令牌创建了“Test(测试)”。在这种情况下,您当前的搜索代码将起作用。您必须仔细观察输入字符串的大小写,以确保如果您正在索引小写文本,则在搜索时也必须执行相同的操作。

在索引和搜索期间使用相同的分析器通常是一个好的做法。您可以使用 KeywordAnalyer 从输入字符串生成单个标记。

While index, you have tokenized the field. So, your input String creates two tokens "test" and "test". For search, you are constructing query by hand ie using TermQuery instead of QueryParser, which would have tokenized the field.

For the entire match, you need to index field UN_TOKENIZED. Here, the input string is taken as a single token. The single token created "Test (Test)." In that case, your current search code will work. You have to watch the case of input string carefully to make sure if you are indexing lower case text, you have to do the same while searching.

It is generally good practice to use same analyzer during indexing and searching. You can use KeywordAnalyer to generate single token from the input string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文