如何从普通的 and、or 和 not 变量中指定 Lucene.net 布尔 AND、OR 和 not 运算符?
在我的项目中,我正在使用 Lucence 实现全文索引搜索。但在这样做时,我陷入了将 Lucene 布尔运算符与普通 and、or 而非单词区分开来的逻辑。
假设我们正在搜索 "I Want a pen andpencil" ,但默认情况下 Lucene.net 搜索 Lucene OR 操作。所以它会像“我或想要或一支或钢笔或铅笔”那样搜索,而不是像我想要的那样“我或想要或一支或一支钢笔或和或铅笔”强>。那么我们如何区分普通的 and、or、not 和 Lucene 运算符呢?
为此,我做了一个辅助方法,看起来像
/// <summary>
/// Method to get search predicates
/// </summary>
/// <param name="searchTerm">Search term</param>
/// <returns>List of predicates</returns>
public static IList<string> GetPredicates(string searchTerm)
{
//// Remove unwanted characters
//searchTerm = Regex.Replace(searchTerm, "[<(.|\n)*?!'`>]", string.Empty);
string exactSearchTerm = string.Empty,
keywordOrSearchTerm = string.Empty,
andSearchTerm = string.Empty,
notSearchTerm = string.Empty,
searchTermWithOutKeywords = string.Empty;
//// Exact search tern
exactSearchTerm = "\"" + searchTerm.Trim() + "\"";
//// Search term without keywords
searchTermWithOutKeywords = Regex.Replace(
searchTerm, " and not | and | or ", " ", RegexOptions.IgnoreCase);
//// Splioted keywords
string[] splittedKeywords = searchTermWithOutKeywords.Trim().Split(
new char[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries);
//// Or search term
keywordOrSearchTerm = string.Join(" OR ", splittedKeywords);
//// And search term
andSearchTerm = string.Join(" AND ", splittedKeywords);
//// not search term
int index = 0;
List<string> searchTerms = (from term in Regex.Split(
searchTerm, " and not ", RegexOptions.IgnoreCase)
where index++ != 0
select term).ToList();
searchTerms = (from term in searchTerms
select Regex.IsMatch(term, " and | or ", RegexOptions.IgnoreCase) ?
Regex.Split(term, " and | or ", RegexOptions.IgnoreCase).FirstOrDefault() :
term).ToList();
notSearchTerm = searchTerms.Count > 0 ? string.Join(" , ", searchTerms) : "\"\"";
return new List<string> { exactSearchTerm, andSearchTerm, keywordOrSearchTerm, notSearchTerm };
}
但它会返回四个结果。所以我必须循环遍历我的索引四次,但这似乎是非常忙碌的一次。那么有人可以帮忙在一个循环中解决这个问题吗?
In my project i was implementing a full text index search using Lucence. But while doing this i was stuck up with a logic of differentiating Lucene boolean operators from Normal and, or , not words.
Suppose for example if we are searching for "I want a pen and pencil" , but by default Lucene.net searching Lucene OR operation. so it will search like "I OR want OR a OR pen OR pencil" not like what i would like to have like "I OR want OR a OR pen OR and OR pencil". So how come we differentiate a normal and, or, not from Lucene operators?
For this I have done a helper method which looks like
/// <summary>
/// Method to get search predicates
/// </summary>
/// <param name="searchTerm">Search term</param>
/// <returns>List of predicates</returns>
public static IList<string> GetPredicates(string searchTerm)
{
//// Remove unwanted characters
//searchTerm = Regex.Replace(searchTerm, "[<(.|\n)*?!'`>]", string.Empty);
string exactSearchTerm = string.Empty,
keywordOrSearchTerm = string.Empty,
andSearchTerm = string.Empty,
notSearchTerm = string.Empty,
searchTermWithOutKeywords = string.Empty;
//// Exact search tern
exactSearchTerm = "\"" + searchTerm.Trim() + "\"";
//// Search term without keywords
searchTermWithOutKeywords = Regex.Replace(
searchTerm, " and not | and | or ", " ", RegexOptions.IgnoreCase);
//// Splioted keywords
string[] splittedKeywords = searchTermWithOutKeywords.Trim().Split(
new char[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries);
//// Or search term
keywordOrSearchTerm = string.Join(" OR ", splittedKeywords);
//// And search term
andSearchTerm = string.Join(" AND ", splittedKeywords);
//// not search term
int index = 0;
List<string> searchTerms = (from term in Regex.Split(
searchTerm, " and not ", RegexOptions.IgnoreCase)
where index++ != 0
select term).ToList();
searchTerms = (from term in searchTerms
select Regex.IsMatch(term, " and | or ", RegexOptions.IgnoreCase) ?
Regex.Split(term, " and | or ", RegexOptions.IgnoreCase).FirstOrDefault() :
term).ToList();
notSearchTerm = searchTerms.Count > 0 ? string.Join(" , ", searchTerms) : "\"\"";
return new List<string> { exactSearchTerm, andSearchTerm, keywordOrSearchTerm, notSearchTerm };
}
but it will return four results. so i have to loop through my index for 4 times , but it seems to be very hectic one. so can anybody give a hand to resolve this one in a single loop?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
就像 @Matt Warren 所建议的那样,lucene 有所谓的“停用词”,它们通常对搜索质量几乎没有增加任何价值,但却使索引变得巨大且臃肿。像“a、and、or、the、an”这样的停用词通常会在索引文本时自动从文本中过滤掉,然后在解析查询时从查询中过滤掉。 StopFilter 在这两种情况下都对这种行为负责,但您可以选择不使用 StopFilter 的分析器。
另一个问题是查询解析。如果我没记错的话,lucene查询解析器只会将大写的
OR
AND
和NOT
视为关键字,所以如果用户输入所有大写字母,您需要将它们替换为小写,这样它就不会被视为运算符。这是一些 Regex.Replace 代码:Like @Matt Warren suggested, lucene has what are called "stop words" that usually add little value to the quality of search but make the index HUGE and bloated. StopWords like "a, and, or, the, an" are usually automatically filtered out of your text as it is indexed, and then filtered out of your query when it is parsed. The StopFilter is resposible for this behavior in both cases, but you can pick an analyzer that does not use the StopFilter.
The other issue is in query parsing. If I remember correctly, the lucene query parser will only treat capitalized
OR
AND
andNOT
as keywords, so if the user types in all capital letters, you'll need to replace them with lower-case so it is not treated as the operators. Here's some Regex.Replace code for that:内置的 StandardAnalyzer 将为您剔除常用单词,请参阅本文了解说明。
The built-in StandardAnalyzer will strip out common words for you, see this article for an explanation.