如何从普通的 and、or 和 not 变量中指定 Lucene.net 布尔 AND、OR 和 not 运算符？

发布于 2024-11-24 03:05:22 字数 2525 浏览 0 评论 0原文

在我的项目中，我正在使用 Lucence 实现全文索引搜索。但在这样做时，我陷入了将 Lucene 布尔运算符与普通 and、or 而非单词区分开来的逻辑。

假设我们正在搜索 "I Want a pen andpencil" ，但默认情况下 Lucene.net 搜索 Lucene OR 操作。所以它会像“我或想要或一支或钢笔或铅笔”那样搜索，而不是像我想要的那样“我或想要或一支或一支钢笔或和或铅笔”强>。那么我们如何区分普通的 and、or、not 和 Lucene 运算符呢？

为此，我做了一个辅助方法，看起来像

/// <summary>
    /// Method to get search predicates
    /// </summary>
    /// <param name="searchTerm">Search term</param>
    /// <returns>List of predicates</returns>
    public static IList<string> GetPredicates(string searchTerm)
    {
        //// Remove unwanted characters
        //searchTerm = Regex.Replace(searchTerm, "[<(.|\n)*?!'`>]", string.Empty);
        string exactSearchTerm = string.Empty,
               keywordOrSearchTerm = string.Empty, 
               andSearchTerm = string.Empty, 
               notSearchTerm = string.Empty,
               searchTermWithOutKeywords = string.Empty;
        //// Exact search tern
        exactSearchTerm = "\"" + searchTerm.Trim() + "\"";
        //// Search term without keywords
        searchTermWithOutKeywords = Regex.Replace(
            searchTerm, " and not | and | or ", " ", RegexOptions.IgnoreCase);
        //// Splioted keywords
        string[] splittedKeywords = searchTermWithOutKeywords.Trim().Split(
            new char[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries);
        //// Or search term
        keywordOrSearchTerm = string.Join(" OR ", splittedKeywords);
        //// And search term
        andSearchTerm = string.Join(" AND ", splittedKeywords);
        //// not search term
        int index = 0;
        List<string> searchTerms = (from term in Regex.Split(
                                        searchTerm, " and not ", RegexOptions.IgnoreCase)
                                        where index++ != 0
                                        select term).ToList();
        searchTerms = (from term in searchTerms
               select Regex.IsMatch(term, " and | or ", RegexOptions.IgnoreCase) ?
               Regex.Split(term, " and | or ", RegexOptions.IgnoreCase).FirstOrDefault() : 
               term).ToList();
        notSearchTerm = searchTerms.Count > 0 ? string.Join(" , ", searchTerms) : "\"\"";
        return new List<string> { exactSearchTerm, andSearchTerm, keywordOrSearchTerm, notSearchTerm };
    }

但它会返回四个结果。所以我必须循环遍历我的索引四次，但这似乎是非常忙碌的一次。那么有人可以帮忙在一个循环中解决这个问题吗？

原文

In my project i was implementing a full text index search using Lucence. But while doing this i was stuck up with a logic of differentiating Lucene boolean operators from Normal and, or , not words.

Suppose for example if we are searching for "I want a pen and pencil" , but by default Lucene.net searching Lucene OR operation. so it will search like "I OR want OR a OR pen OR pencil" not like what i would like to have like "I OR want OR a OR pen OR and OR pencil". So how come we differentiate a normal and, or, not from Lucene operators?

For this I have done a helper method which looks like

/// <summary>
    /// Method to get search predicates
    /// </summary>
    /// <param name="searchTerm">Search term</param>
    /// <returns>List of predicates</returns>
    public static IList<string> GetPredicates(string searchTerm)
    {
        //// Remove unwanted characters
        //searchTerm = Regex.Replace(searchTerm, "[<(.|\n)*?!'`>]", string.Empty);
        string exactSearchTerm = string.Empty,
               keywordOrSearchTerm = string.Empty, 
               andSearchTerm = string.Empty, 
               notSearchTerm = string.Empty,
               searchTermWithOutKeywords = string.Empty;
        //// Exact search tern
        exactSearchTerm = "\"" + searchTerm.Trim() + "\"";
        //// Search term without keywords
        searchTermWithOutKeywords = Regex.Replace(
            searchTerm, " and not | and | or ", " ", RegexOptions.IgnoreCase);
        //// Splioted keywords
        string[] splittedKeywords = searchTermWithOutKeywords.Trim().Split(
            new char[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries);
        //// Or search term
        keywordOrSearchTerm = string.Join(" OR ", splittedKeywords);
        //// And search term
        andSearchTerm = string.Join(" AND ", splittedKeywords);
        //// not search term
        int index = 0;
        List<string> searchTerms = (from term in Regex.Split(
                                        searchTerm, " and not ", RegexOptions.IgnoreCase)
                                        where index++ != 0
                                        select term).ToList();
        searchTerms = (from term in searchTerms
               select Regex.IsMatch(term, " and | or ", RegexOptions.IgnoreCase) ?
               Regex.Split(term, " and | or ", RegexOptions.IgnoreCase).FirstOrDefault() : 
               term).ToList();
        notSearchTerm = searchTerms.Count > 0 ? string.Join(" , ", searchTerms) : "\"\"";
        return new List<string> { exactSearchTerm, andSearchTerm, keywordOrSearchTerm, notSearchTerm };
    }

but it will return four results. so i have to loop through my index for 4 times , but it seems to be very hectic one. so can anybody give a hand to resolve this one in a single loop?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亂 2024-12-01 03:05:22

就像 @Matt Warren 所建议的那样，lucene 有所谓的“停用词”，它们通常对搜索质量几乎没有增加任何价值，但却使索引变得巨大且臃肿。像“a、and、or、the、an”这样的停用词通常会在索引文本时自动从文本中过滤掉，然后在解析查询时从查询中过滤掉。 StopFilter 在这两种情况下都对这种行为负责，但您可以选择不使用 StopFilter 的分析器。

另一个问题是查询解析。如果我没记错的话，lucene查询解析器只会将大写的 OR AND 和 NOT 视为关键字，所以如果用户输入所有大写字母，您需要将它们替换为小写，这样它就不会被视为运算符。这是一些 Regex.Replace 代码：

string queryString = "the red pencil and blue pencil are both not green or brown";
queryString = 
   Regex.Replace (
       queryString, 
       @"\b(?:OR|AND|NOT)\b", 
       m => m.Value.ToLowerInvariant ());

Like @Matt Warren suggested, lucene has what are called "stop words" that usually add little value to the quality of search but make the index HUGE and bloated. StopWords like "a, and, or, the, an" are usually automatically filtered out of your text as it is indexed, and then filtered out of your query when it is parsed. The StopFilter is resposible for this behavior in both cases, but you can pick an analyzer that does not use the StopFilter.

The other issue is in query parsing. If I remember correctly, the lucene query parser will only treat capitalized OR AND and NOT as keywords, so if the user types in all capital letters, you'll need to replace them with lower-case so it is not treated as the operators. Here's some Regex.Replace code for that:

string queryString = "the red pencil and blue pencil are both not green or brown";
queryString = 
   Regex.Replace (
       queryString, 
       @"\b(?:OR|AND|NOT)\b", 
       m => m.Value.ToLowerInvariant ());

回复收藏 0 原文