C# 具有多个输入的高效子字符串
假设我不想使用外部库或十几行左右的额外代码(即清晰的代码,不是代码高尔夫代码),我可以做得更好吗string.Contains
来处理输入字符串的集合和要检查的关键字的集合?
显然,可以使用 objString.Contains(objString2) 进行简单的子字符串检查。 然而,有许多众所周知的算法在特殊情况下能够比这做得更好,特别是在处理多个字符串时。 但是将这样的算法粘贴到我的代码中可能会增加长度和复杂性,所以我宁愿使用某种基于内置函数的快捷方式。
例如,输入可以是字符串的集合、肯定关键字的集合和否定关键字的集合。 输出将是第一个关键字集合的子集,所有关键字都至少有 1 个肯定关键字,但有 0 个否定关键字。
哦,请不要提及正则表达式作为建议的解决方案。
我的要求可能是相互排斥的(没有太多额外的代码,没有外部库或正则表达式,比 String.Contains 更好),但我想我会问。
编辑:
很多人只提供愚蠢的改进,这些改进不会比智能使用的 contains 调用强太多(如果有的话)。 有些人试图更智能地调用 Contains,这完全没有抓住我的问题的重点。 这是一个要尝试解决的问题的示例。 LBushkin 的解决方案是某人提供的解决方案可能渐近优于标准包含的解决方案的一个示例:
假设您有 10,000 个长度为 5-15 个字符的肯定关键字、0 个否定关键字(这似乎使人们感到困惑)和 1 个 1,000,000 个字符串。 检查这 1,000,000 个字符串是否至少包含 1 个肯定关键字。
我认为一种解决方案是创建 FSA。 另一个是分隔空格并使用散列。
Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains
to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2)
to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您对“否定和肯定”关键字的讨论有些令人困惑 - 可以使用一些澄清来获得更完整的答案。
与所有与性能相关的问题一样 - 您应该首先编写简单的版本,然后对其进行分析以确定瓶颈在哪里 - 这些可能不直观且难以预测。 话虽如此......
优化搜索的一种方法可能(如果您总是搜索“单词” - 而不是可能包含空格的短语)是从您的字符串构建搜索索引。
搜索索引可以是排序数组(用于二分搜索)或字典。 字典可能会更快 - 因为字典内部是具有 O(1) 查找的哈希图,并且字典自然会消除搜索源中的重复值 - 从而减少需要执行的比较次数。
一般搜索算法是:
对于要搜索的每个字符串:
这是在 C# 2.0 中使用排序数组和二分搜索的示例:
注意:您可以从
string[]
切换到List
string>
很容易,我把它留给你。这是一个在 C# 3.0 中使用基于字典的索引和 LINQ 的版本:
注意:这不是最 LINQ-y 的方法是,我可以使用 Union() 和 SelectMany() 将整个算法编写为单个大 LINQ 语句 - 但我发现这更容易理解。
Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from
string[]
toList<string>
easily enough, I leave that to you.Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
如果添加此扩展方法:
那么这将成为一行语句:
这不一定比执行 contains 检查更快,只是它会高效地执行这些检查,因为 LINQ 的结果流可防止任何不必要的 contains 调用...另外,生成的代码是单行代码,这很好。
If you add this extension method:
Then this becomes a one line statement:
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.
如果您确实只是在寻找以空格分隔的单词,那么此代码将是一个非常简单的实现:
如果您只想得到是/否答案(正如我现在所看到的可能是这种情况),那么还有另一种哈希集“重叠”方法这可能对此进行了更好的优化:
If you're truly just looking for space-delimited words, this code would be a very simple implementation:
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
好吧,您可以对字符串调用 Split() 方法。 您可以使用 Split() 将输入字符串拆分为单词数组,然后使用关键字对单词进行一对一检查。 然而,我不知道这是否或在什么情况下会比使用 Contains() 更快。
Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.
首先去掉所有包含否定词的字符串。 我建议使用 Contains 方法来执行此操作。 我认为 Contains() 比分割、排序和搜索更快。
First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.
在我看来,最好的方法是获取匹配字符串(正数和负数)并计算它们的哈希值。 然后遍历一百万个字符串计算 n 个哈希值(在您的情况下,长度为 5-15 的字符串为 10),并与匹配字符串的哈希值进行匹配。 如果获得哈希匹配,则进行实际的字符串比较以排除误报。 有许多好方法可以通过按长度对匹配字符串进行存储并根据特定存储桶的字符串大小创建哈希来对此进行优化。
所以你会得到类似的东西:
Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
要优化 Contains(),您需要一个正面/负面单词的树(或 trie)结构。
这应该会加快一切速度(O(n) vs O(nm),n=字符串大小,m=平均字大小)并且代码相对较小& 简单的。
To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.