C# 具有多个输入的高效子字符串

发布于 2024-07-29 12:59:47 字数 833 浏览 3 评论 0原文

假设我不想使用外部库或十几行左右的额外代码（即清晰的代码，不是代码高尔夫代码），我可以做得更好吗string.Contains 来处理输入字符串的集合和要检查的关键字的集合？

显然，可以使用 objString.Contains(objString2) 进行简单的子字符串检查。然而，有许多众所周知的算法在特殊情况下能够比这做得更好，特别是在处理多个字符串时。但是将这样的算法粘贴到我的代码中可能会增加长度和复杂性，所以我宁愿使用某种基于内置函数的快捷方式。

例如，输入可以是字符串的集合、肯定关键字的集合和否定关键字的集合。输出将是第一个关键字集合的子集，所有关键字都至少有 1 个肯定关键字，但有 0 个否定关键字。

哦，请不要提及正则表达式作为建议的解决方案。

我的要求可能是相互排斥的（没有太多额外的代码，没有外部库或正则表达式，比 String.Contains 更好），但我想我会问。

编辑：

很多人只提供愚蠢的改进，这些改进不会比智能使用的 contains 调用强太多（如果有的话）。有些人试图更智能地调用 Contains，这完全没有抓住我的问题的重点。这是一个要尝试解决的问题的示例。 LBushkin 的解决方案是某人提供的解决方案可能渐近优于标准包含的解决方案的一个示例：

假设您有 10,000 个长度为 5-15 个字符的肯定关键字、0 个否定关键字（这似乎使人们感到困惑）和 1 个 1,000,000 个字符串。检查这 1,000,000 个字符串是否至少包含 1 个肯定关键字。

我认为一种解决方案是创建 FSA。另一个是分隔空格并使用散列。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绻影浮沉 2024-08-05 12:59:47

您对“否定和肯定”关键字的讨论有些令人困惑 - 可以使用一些澄清来获得更完整的答案。

与所有与性能相关的问题一样 - 您应该首先编写简单的版本，然后对其进行分析以确定瓶颈在哪里 - 这些可能不直观且难以预测。话虽如此......

优化搜索的一种方法可能（如果您总是搜索“单词” - 而不是可能包含空格的短语）是从您的字符串构建搜索索引。

搜索索引可以是排序数组（用于二分搜索）或字典。字典可能会更快 - 因为字典内部是具有 O(1) 查找的哈希图，并且字典自然会消除搜索源中的重复值 - 从而减少需要执行的比较次数。

一般搜索算法是：

对于要搜索的每个字符串：

获取要搜索的字符串并将其标记为单个单词（由空格分隔）
将标记填充到搜索索引（排序数组或字典）
搜索索引对于您的“否定关键字”，如果找到，请跳到下一个搜索字符串在
索引中搜索您的“肯定关键字”，当找到一个时，将其添加到字典中（您还可以跟踪频率的计数）

这是在 C# 2.0 中使用排序数组和二分搜索的示例：

注意：您可以从 string[] 切换到 Liststring> 很容易，我把它留给你。

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

这是一个在 C# 3.0 中使用基于字典的索引和 LINQ 的版本：

注意：这不是最 LINQ-y 的方法是，我可以使用 Union() 和 SelectMany() 将整个算法编写为单个大 LINQ 语句 - 但我发现这更容易理解。

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.

As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...

One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.

The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.

The general search algorithm is:

For each string you are searching against:

Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)

Here's an example using a sorted array and binary search in C# 2.0:

NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

Here's a version that uses a dictionary-based index and LINQ in C# 3.0:

NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }

回复收藏 0 原文

清醇 2024-08-05 12:59:47

如果添加此扩展方法：

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

那么这将成为一行语句：

var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

这不一定比执行 contains 检查更快，只是它会高效地执行这些检查，因为 LINQ 的结果流可防止任何不必要的 contains 调用...另外，生成的代码是单行代码，这很好。

If you add this extension method:

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

Then this becomes a one line statement:

var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

回复收藏 0 原文

冷…雨湿花 2024-08-05 12:59:47

如果您确实只是在寻找以空格分隔的单词，那么此代码将是一个非常简单的实现：

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

如果您只想得到是/否答案（正如我现在所看到的可能是这种情况），那么还有另一种哈希集“重叠”方法这可能对此进行了更好的优化：

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }

If you're truly just looking for space-delimited words, this code would be a very simple implementation:

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }

回复收藏 0 原文

月朦胧 2024-08-05 12:59:47

好吧，您可以对字符串调用 Split() 方法。您可以使用 Split() 将输入字符串拆分为单词数组，然后使用关键字对单词进行一对一检查。然而，我不知道这是否或在什么情况下会比使用 Contains() 更快。

回复收藏 0 原文

一百个冬季 2024-08-05 12:59:47

首先去掉所有包含否定词的字符串。我建议使用 Contains 方法来执行此操作。我认为 Contains() 比分割、排序和搜索更快。

回复收藏 0 原文

帅的被狗咬 2024-08-05 12:59:47

在我看来，最好的方法是获取匹配字符串（正数和负数）并计算它们的哈希值。然后遍历一百万个字符串计算 n 个哈希值（在您的情况下，长度为 5-15 的字符串为 10），并与匹配字符串的哈希值进行匹配。如果获得哈希匹配，则进行实际的字符串比较以排除误报。有许多好方法可以通过按长度对匹配字符串进行存储并根据特定存储桶的字符串大小创建哈希来对此进行优化。

所以你会得到类似的东西：

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.

So you get something like:

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}

回复收藏 0 原文