C# 中字符串比较的更快算法
我有两个句子需要相互比较。 最终结果是一个句子包含另一个句子的百分比,我的问题是我有 100.000 条记录需要与另外 10 条记录进行比较。 那是 1.000.000 次循环,在我的算法中非常慢。
这是我正在使用的算法:
private double BreakStringsAndCheck(string s1, string s2)
{
if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
return (double)0;
string[] firstArray = s1.Split(' ');
string[] secondArray = s2.Split(' ');
if (firstArray.Length > secondArray.Length)
{
string[] tempArray = firstArray;
firstArray = secondArray;
secondArray = tempArray;
}
double value = 0;
for (int i = 0; i < firstArray.Length; i++)
for (int j = 0; j < secondArray.Length; j++)
value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
return findLongest ? value : value / firstArray.Length;
}
这是一个小方法,但速度不是很快。根据我的测试,我可以在 1 秒内进行 40-60 次比较,这对于 1.000.000 次循环来说几乎需要 5 个小时。
有人能想到另一种比这更快的方法或逻辑吗?
更新:
我会尝试用更多细节来解释这个问题。 我有超过 100.000 条记录的数据库,每天我都会在该数据库中插入并比较 10-20 条新记录。 这些记录是 2 到 10 个单词的句子,我需要编写快速方法来将这些新记录与数据库中的记录进行比较,结果应该是一个句子包含另一个句子的单词量的百分比。
我需要单词匹配率超过 70% 的记录。
我希望我现在已经清楚了。
I have two sentences that needed to be compared to each-other.
The final result is how much percent one sentence contains in the other, my problem is that I have 100.000 records that need to be compared with lets say another 10.
That is 1.000.000 loops, which in my algorithm is very slow.
This is the algorithm that I am using:
private double BreakStringsAndCheck(string s1, string s2)
{
if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
return (double)0;
string[] firstArray = s1.Split(' ');
string[] secondArray = s2.Split(' ');
if (firstArray.Length > secondArray.Length)
{
string[] tempArray = firstArray;
firstArray = secondArray;
secondArray = tempArray;
}
double value = 0;
for (int i = 0; i < firstArray.Length; i++)
for (int j = 0; j < secondArray.Length; j++)
value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
return findLongest ? value : value / firstArray.Length;
}
It's a small method but it is not very fast. From my testing I can do 40-60 comparisons in 1 second, that is almost 5 hours for 1.000.000 loops.
Can some one think of another method or logic that is much faster than this?
Update:
I will try to explain the problem with more details.
I have database with more than 100.000 records, and every day I insert, and compare 10-20 new record in this database.
This records are sentences from 2 to 10 words and I need to write fast method that will compare this new records with those in database, the result should be percentage of how much one sentence contains the words from the other.
I need the records that has more than 70% word match.
I hope that I'm clear now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我不是 C# 程序员,但这里有一些一般性提示:
split
的调用。基本上,删除任何额外的内存分配。最后的想法是找一本算法书或谷歌搜索文本处理算法。这个问题听起来像是已经被解决了一遍又一遍的问题。 AOCP v3 中可能有一些东西可以解决这个问题。您还可以分析代码(不确定可用的分析器类型),但这可能不会产生实质性的改进。
I'm not a C# programmer, but here are a few general tips:
split
if you can. Basically, remove any extra memory allocations.The final thought is to grab an algorithms book or google for text processing algorithms. This problem sounds like something that has been solved over and over again. There is probably something in AOCP v3 that solves this problem. You could also profile the code (not sure what types of profilers are available), but that probably won't yield substantial improvements.
您是否看过 Intersect 方法作为替代方法。我不知道它的性能,但看起来它可能有用
Have you looked at the Intersect method as an alternative. I have no idea about its performance but it looks like it may work
就我个人而言,我会避免创建这两个数组;内存分配会降低性能。
尝试查看 string.IndexOf 函数来查找下一个空格的位置在这两个字符串中,从前一个空格位置中减去该值即可计算出单词长度。如果两个长度相等,则使用 string.Compare 来查看是否两个子串相等。这将避免内存分配,并且只迭代字符串一次,因此应该更快。
另外,正如其他人提到的,一定要考虑使用并行扩展。
Personally I'd avoid creating the two arrays; the memory allocations will kill performance.
Try looking at the string.IndexOf function to find where the next space is in both strings, subtract that from the previous space location to work out the word length. If the two lengths are equal then use string.Compare to see if the two sub-strings are equal. This will avoid memory allocations and only iterate through the strings once, so should be faster.
Also, as others have mentioned, definitely look at using the Parallel extensions.
这是一种不同的方法。我猜测,当您将 10 个句子与 100'000 个句子进行比较时,将会有大量单词没有匹配且 % = 0。不要总是执行 100'000 次比较,而是在 100'000 个句子中找到这些句子其中至少有一个单词匹配并且仅比较它们。
创建(一次)包含 100,000 个句子中所有单词的字典。
每个条目都是包含该单词的句子列表 L。
Here's a different approach. I'm guessing that when you compare 10 sentences against 100'000 sentences, there are going to be a large number where no words match and % = 0. Instead of always performing 100'000 comparisons, find those sentences in the 100'000 where at least one word matches and only compare them.
Create (once) a dictionary of all the words in the 100'000 sentences.
Each entry is a list L of sentences containing this word.
试试这个。
在执行任何比较之前,请预处理 100,000 行。
100,000 行中的每个单词都将成为
Dictionary<>
对象中的键,值将是 id 列表(单词出现的每行的 id),例如当“搜索匹配项”时,您保留第二个字典,该字典以行 id 为键,它的值是您将递增的整数。例如,
您将搜索字符串拆分为单词,并且对于每个单词的每个行 id,您递增该行 id 的值。
具有最大值的行 id 是最佳匹配。
预先构建字典需要一些时间(但我猜不会比单次比较多多少时间),但之后速度会快得令人眼花缭乱。
注意: 这里速度的关键是 Dictionary 将使用它存储的键的 HashCode,并且 .net 字符串哈希函数非常出色。
更新
如果此订单的预处理时间太长,那么您可以进行更轻松的预处理。
当您读取 100,000 行中的每一行时,将其拆分为单词,并对单词数组进行排序。然后在比较时,拆分字符串进行比较并对其进行排序。
然后,您的函数可以节省时间,因为它不会多次分割每个字符串,并且您的嵌套循环可以替换为
min(words1.length, Words2.length)
的循环。Try this.
Before performing any comparisons, preprocess the 100,000 rows.
Every word in the 100,000 rows is going to be a key in a
Dictionary<>
object, the value is going to be a list of id's (the id's of each row that word occurs on), e.g.When "searching for a match", you keep a second dictionary, this one is keyed by row id, and it's value is an integer you'll increment. e.g.
You split the search string into words, and for each row id for each word you increment the value for that row id.
The row id with the greatest value is the best match.
It'll take some time up front to build the dictionary (but I'd guess not much more than for a single comparison), but it will be blindingly fast after that.
NB: The key to the speed here is that Dictionary will use the HashCode of the key it's storing, and the .net hash function for strings is excellent.
Update
If pre-processing on this order takes too long, then you can do a lighter pre-process.
As you read each of the 100,000 rows, split it into words, and sort the array of words. Then as you compare, split the string to compare and sort it also.
Your function then saves time because it isn't splitting each string multiple times, and your nested loops can be replaced with a loop for
min(words1.length, words2.length)
.既然数据都在数据库里,那能不能不在数据库里做这些工作呢?
根据句子行将句子分解为单词。
将你的话语与破碎的话语结合起来。这应该可以让您看到哪些句子有匹配的单词。
如果您然后按句子 id 对它们进行分组和求和,您应该获得指定句子中与存储的句子匹配的单词的总和。
我会考虑提前粉碎你的数据。使用它们作为主句表的索引。
As the data is in the database, can you not do the work in the database?
Shred the sentences to words against sentence row.
Join your words against the shredded words. This should allow you to see which sentences have a matching word.
If you then group and sum them by the sentence id you should get the sum of words that match in the specified sentence against stored sentences.
I would look to shredding your data beforehand. Use them as indexes against your main sentence table.
相交示例
我更愿意返回比率 0.4 而不是 40.0,因为:
我刚刚意识到您的算法总是将较短的字符串与较长的字符串进行比较。因此,即使输入参数像这样切换,您的算法也会返回 40.0
,但我的相交示例将返回 18.18。我觉得这更正确,但如果你真的想要你的方式,那么只需添加
到方法的开头即可。
预分割
...
然后循环遍历
Parallel.For
中的所有 100.000 个字符串。附言。我认为您必须小写并从字符串中删除
.
、、
等以获得更正确的比率。DS。
Intersect example
I would prefer to return the ratio 0.4 instead of 40.0 for:
I just realized that your algorithm always compares the shorter string to the longer. So your algorithm would return 40.0 even if the input parameters are switched like this
but my intersect example will return 18.18. I feel that this is more correct but if you really want your way then just add
to the beginning of the method.
Presplitting
...
then loop over all your 100.000 strings in a
Parallel.For
.PS. I think that you will have to downcase and remove
.
,,
and so on from the strings to get a more correct ratio.DS.
如果您首先拆分 10 条记录,那么您会在许多较大的字符串中找到少量的字符串。这似乎适合 http://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_finite_set_of_patterns
算法可能适合您
Aho- Corasick 记录?
编辑:
这是一个不必要的转换 - 您的比较是对称的 firstArray 和 secondaryArray
相反,将 return 替换为
return findLongest ?值 : (firstArray.Length > secondaryArray.Length) ?值/secondArray.length : 值/firstArray.Length);
只有更具可读性的东西:)
问题更新后更新
所以你可以预处理100,000(例如散列单词)?每天只有 10-20 次更改,因此保持预处理数据最新很容易。
你肯定需要做一些利用 100,000 的相对静态特性的事情。即使您每天只进行一次预处理,您也可以与最近几天的所有记录进行比较,然后对自上次预处理运行以来添加的任何其他记录使用当前较慢的方法。根据你的说法,最多有 10-20 个,
我认为散列想法或从语料库构建 Aho-Comisack trie 会让你的搜索速度更快。
If you split the 10 records first, then you're finding a small number of strings in many larger strings. This seems to fit http://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_finite_set_of_patterns
and the Aho-Corasick algorithm might work well for you
How long are the records?
EDIT:
This is an unnecessary switcharound - your comparison is symmetric wrt firstArray and secondArray
instead, replace the return with
return findLongest ? value : (firstArray.Length > secondArray.Length) ? value/secondArray.length : value / firstArray.Length);
only with something more readable :)
UPDATE after question update
So you could pre-process the 100,000 (e.g. to hash the words)? And only 10-20 change per day so keeping the preprocessed data up to date would be easy.
You definitely need to do something that uses the relatively-static nature of the 100,000. Even if you did the pre-processing just once per day, you could do the comparison with all of last days' records, then use your current slowish approach for any others added since the last preprocessing run. From what you say, there will be at most 10-20 of those
I think either the hashing idea, or building a Aho-Comisack trie from the corpus would give you much faster searching.