C# 中字符串比较的更快算法

发布于 2024-10-04 03:15:36 字数 1159 浏览 3 评论 0原文

我有两个句子需要相互比较。 最终结果是一个句子包含另一个句子的百分比,我的问题是我有 100.000 条记录需要与另外 10 条记录进行比较。 那是 1.000.000 次循环,在我的算法中非常慢。

这是我正在使用的算法:

private double BreakStringsAndCheck(string s1, string s2)
{
    if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
        return (double)0;
    string[] firstArray = s1.Split(' ');
    string[] secondArray = s2.Split(' ');
    if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }
    double value = 0;
    for (int i = 0; i < firstArray.Length; i++)
        for (int j = 0; j < secondArray.Length; j++)
            value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
    return findLongest ? value : value / firstArray.Length;
}

这是一个小方法,但速度不是很快。根据我的测试,我可以在 1 秒内进行 40-60 次比较,这对于 1.000.000 次循环来说几乎需要 5 个小时。

有人能想到另一种比这更快的方法或逻辑吗?

更新:

我会尝试用更多细节来解释这个问题。 我有超过 100.000 条记录的数据库,每天我都会在该数据库中插入并比较 10-20 条新记录。 这些记录是 2 到 10 个单词的句子,我需要编写快速方法来将这些新记录与数据库中的记录进行比较,结果应该是一个句子包含另一个句子的单词量的百分比。

我需要单词匹配率超过 70% 的记录。

我希望我现在已经清楚了。

I have two sentences that needed to be compared to each-other.
The final result is how much percent one sentence contains in the other, my problem is that I have 100.000 records that need to be compared with lets say another 10.
That is 1.000.000 loops, which in my algorithm is very slow.

This is the algorithm that I am using:

private double BreakStringsAndCheck(string s1, string s2)
{
    if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
        return (double)0;
    string[] firstArray = s1.Split(' ');
    string[] secondArray = s2.Split(' ');
    if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }
    double value = 0;
    for (int i = 0; i < firstArray.Length; i++)
        for (int j = 0; j < secondArray.Length; j++)
            value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
    return findLongest ? value : value / firstArray.Length;
}

It's a small method but it is not very fast. From my testing I can do 40-60 comparisons in 1 second, that is almost 5 hours for 1.000.000 loops.

Can some one think of another method or logic that is much faster than this?

Update:

I will try to explain the problem with more details.
I have database with more than 100.000 records, and every day I insert, and compare 10-20 new record in this database.
This records are sentences from 2 to 10 words and I need to write fast method that will compare this new records with those in database, the result should be percentage of how much one sentence contains the words from the other.

I need the records that has more than 70% word match.

I hope that I'm clear now.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

旧城烟雨 2024-10-11 03:15:36

我不是 C# 程序员,但这里有一些一般性提示:

  1. 将浮点运算移出循环。您应该能够计算匹配的字符并稍后进行除法。
  2. 由于数据是静态的,您应该能够在单独的执行线程中运行每个“长”循环。我将为每个“10”句子生成一个单独的线程并并行运行它们。
  3. 如果可以的话,您可能希望删除对 split 的调用。基本上,删除任何额外的内存分配。

最后的想法是找一本算法书或谷歌搜索文本处理算法。这个问题听起来像是已经被解决了一遍又一遍的问题。 AOCP v3 中可能有一些东西可以解决这个问题。您还可以分析代码(不确定可用的分析器类型),但这可能不会产生实质性的改进。

I'm not a C# programmer, but here are a few general tips:

  1. Move the floating point arithmetic out of the loop. You should be able to count the characters that match and do the division later.
  2. You should be able to run each "long" loop in a separate thread of execution since the data is static. I would spawn a separate thread for each of your "10" sentences and run them in parallel.
  3. You might want to remove the call to split if you can. Basically, remove any extra memory allocations.

The final thought is to grab an algorithms book or google for text processing algorithms. This problem sounds like something that has been solved over and over again. There is probably something in AOCP v3 that solves this problem. You could also profile the code (not sure what types of profilers are available), but that probably won't yield substantial improvements.

不美如何 2024-10-11 03:15:36

您是否看过 Intersect 方法作为替代方法。我不知道它的性能,但看起来它可能有用

Have you looked at the Intersect method as an alternative. I have no idea about its performance but it looks like it may work

毁虫ゝ 2024-10-11 03:15:36

就我个人而言,我会避免创建这两个数组;内存分配会降低性能。

尝试查看 string.IndexOf 函数来查找下一个空格的位置在这两个字符串中,从前一个空格位置中减去该值即可计算出单词长度。如果两个长度相等,则使用 string.Compare 来查看是否两个子串相等。这将避免内存分配,并且只迭代字符串一次,因此应该更快。

另外,正如其他人提到的,一定要考虑使用并行扩展。

Personally I'd avoid creating the two arrays; the memory allocations will kill performance.

Try looking at the string.IndexOf function to find where the next space is in both strings, subtract that from the previous space location to work out the word length. If the two lengths are equal then use string.Compare to see if the two sub-strings are equal. This will avoid memory allocations and only iterate through the strings once, so should be faster.

Also, as others have mentioned, definitely look at using the Parallel extensions.

恏ㄋ傷疤忘ㄋ疼 2024-10-11 03:15:36

这是一种不同的方法。我猜测,当您将 10 个句子与 100'000 个句子进行比较时,将会有大量单词没有匹配且 % = 0。不要总是执行 100'000 次比较,而是在 100'000 个句子中找到这些句子其中至少有一个单词匹配并且仅比较它们。

创建(一次)包含 100,000 个句子中所有单词的字典。

每个条目都是包含该单词的句子列表 L。

tobetested=empty
For each s in the 10 sentences
  for each word in s
    if dictionary.contains(word) then
      add members of L that aren't already there to tobetested
  next
  for each sentence to tobetested ' hopefully much less than 100'000
    compare using your algorithm
  next
next

Here's a different approach. I'm guessing that when you compare 10 sentences against 100'000 sentences, there are going to be a large number where no words match and % = 0. Instead of always performing 100'000 comparisons, find those sentences in the 100'000 where at least one word matches and only compare them.

Create (once) a dictionary of all the words in the 100'000 sentences.

Each entry is a list L of sentences containing this word.

tobetested=empty
For each s in the 10 sentences
  for each word in s
    if dictionary.contains(word) then
      add members of L that aren't already there to tobetested
  next
  for each sentence to tobetested ' hopefully much less than 100'000
    compare using your algorithm
  next
next
眼波传意 2024-10-11 03:15:36

试试这个。

在执行任何比较之前,请预处理 100,000 行。
100,000 行中的每个单词都将成为 Dictionary<> 对象中的键,值将是 id 列表(单词出现的每行的 id),例如

Dictionary<string, List<int>> allWords

当“搜索匹配项”时,您保留第二个字典,该字典以行 id 为键,它的值是您将递增的整数。例如,

Dictionary<int, int> matches

您将搜索字符串拆分为单词,并且对于每个单词的每个行 id,您递增该行 id 的值。

var searchWords = search.Split(" ");
foreach(var word in searchWord)
{
    foreach(var id in allWords[word])
        matches[id] += 1;
}
var bestRowId = (from m in matches orderby m.Value select m.Key).Last();

具有最大值的行 id 是最佳匹配。

预先构建字典需要一些时间(但我猜不会比单次比较多多少时间),但之后速度会快得令人眼花缭乱。

注意: 这里速度的关键是 Dictionary 将使用它存储的键的 HashCode,并且 .net 字符串哈希函数非常出色。

更新

如果此订单的预处理时间太长,那么您可以进行更轻松的预处理。
当您读取 100,000 行中的每一行时,将其拆分为单词,并对单词数组进行排序。然后在比较时,拆分字符串进行比较并对其进行排序。
然后,您的函数可以节省时间,因为它不会多次分割每个字符串,并且您的嵌套循环可以替换为 min(words1.length, Words2.length) 的循环。

Try this.

Before performing any comparisons, preprocess the 100,000 rows.
Every word in the 100,000 rows is going to be a key in a Dictionary<> object, the value is going to be a list of id's (the id's of each row that word occurs on), e.g.

Dictionary<string, List<int>> allWords

When "searching for a match", you keep a second dictionary, this one is keyed by row id, and it's value is an integer you'll increment. e.g.

Dictionary<int, int> matches

You split the search string into words, and for each row id for each word you increment the value for that row id.

var searchWords = search.Split(" ");
foreach(var word in searchWord)
{
    foreach(var id in allWords[word])
        matches[id] += 1;
}
var bestRowId = (from m in matches orderby m.Value select m.Key).Last();

The row id with the greatest value is the best match.

It'll take some time up front to build the dictionary (but I'd guess not much more than for a single comparison), but it will be blindingly fast after that.

NB: The key to the speed here is that Dictionary will use the HashCode of the key it's storing, and the .net hash function for strings is excellent.

Update

If pre-processing on this order takes too long, then you can do a lighter pre-process.
As you read each of the 100,000 rows, split it into words, and sort the array of words. Then as you compare, split the string to compare and sort it also.
Your function then saves time because it isn't splitting each string multiple times, and your nested loops can be replaced with a loop for min(words1.length, words2.length).

微凉 2024-10-11 03:15:36

既然数据都在数据库里,那能不能不在数据库里做这些工作呢?

根据句子行将句子分解为单词。

将你的话语与破碎的话语结合起来。这应该可以让您看到哪些句子有匹配的单词。

如果您然后按句子 id 对它们进行分组和求和,您应该获得指定句子中与存储的句子匹配的单词的总和。

我会考虑提前粉碎你的数据。使用它们作为主句表的索引。

As the data is in the database, can you not do the work in the database?

Shred the sentences to words against sentence row.

Join your words against the shredded words. This should allow you to see which sentences have a matching word.

If you then group and sum them by the sentence id you should get the sum of words that match in the specified sentence against stored sentences.

I would look to shredding your data beforehand. Use them as indexes against your main sentence table.

§对你不离不弃 2024-10-11 03:15:36

相交示例

private double BreakStringsAndCheck(string s1, string s2)
{
    var split1 = s1.Split(' ');
    return (double)split1.Intersect(s2.Split(' ')).Count() / split1.Count() * 100.0;
}

我更愿意返回比率 0.4 而不是 40.0,因为:

var percent = BreakStringsAndCheck("Jan Banan går till GAIS.", "I Torsk på Tallin så var en annan Jan Banan med.");

我刚刚意识到您的算法总是将较短的字符串与较长的字符串进行比较。因此,即使输入参数像这样切换,您的算法也会返回 40.0

var percent = BreakStringsAndCheck("I Torsk på Tallin så var en annan Jan Banan med.", "Jan Banan går till GAIS.");

,但我的相交示例将返回 18.18。我觉得这更正确,但如果你真的想要你的方式,那么只需添加

if (s1.Length > s2.Length)
{
    var tmp = s2;
    s2 = s1;
    s1 = tmp;
}

到方法的开头即可。

预分割

var presplits = new List<string[]>() { s1.Split(' '), s2.Split(' '), s3.Split(' ') };

...

private static IEnumerable<double> StringsInString(IEnumerable<string[]> strings, string s2)
{
    return strings.Select(h => (double)h.Intersect(s2.Split(' ')).Count() / h.Count());
}

然后循环遍历 Parallel.For 中的所有 100.000 个字符串。

附言。我认为您必须小写并从字符串中删除 . 等以获得更正确的比率。
DS。

Intersect example

private double BreakStringsAndCheck(string s1, string s2)
{
    var split1 = s1.Split(' ');
    return (double)split1.Intersect(s2.Split(' ')).Count() / split1.Count() * 100.0;
}

I would prefer to return the ratio 0.4 instead of 40.0 for:

var percent = BreakStringsAndCheck("Jan Banan går till GAIS.", "I Torsk på Tallin så var en annan Jan Banan med.");

I just realized that your algorithm always compares the shorter string to the longer. So your algorithm would return 40.0 even if the input parameters are switched like this

var percent = BreakStringsAndCheck("I Torsk på Tallin så var en annan Jan Banan med.", "Jan Banan går till GAIS.");

but my intersect example will return 18.18. I feel that this is more correct but if you really want your way then just add

if (s1.Length > s2.Length)
{
    var tmp = s2;
    s2 = s1;
    s1 = tmp;
}

to the beginning of the method.

Presplitting

var presplits = new List<string[]>() { s1.Split(' '), s2.Split(' '), s3.Split(' ') };

...

private static IEnumerable<double> StringsInString(IEnumerable<string[]> strings, string s2)
{
    return strings.Select(h => (double)h.Intersect(s2.Split(' ')).Count() / h.Count());
}

then loop over all your 100.000 strings in a Parallel.For.

PS. I think that you will have to downcase and remove ., , and so on from the strings to get a more correct ratio.
DS.

仅冇旳回忆 2024-10-11 03:15:36

如果您首先拆分 10 条记录,那么您会在许多较大的字符串中找到少量的字符串。这似乎适合 http://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_finite_set_of_patterns

算法可能适合您

Aho- Corasick 记录?

编辑:

这是一个不必要的转换 - 您的比较是对称的 firstArray 和 secondaryArray

 if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }

相反,将 return 替换为

return findLongest ?值 : (firstArray.Length > secondaryArray.Length) ?值/secondArray.length : 值/firstArray.Length);

只有更具可读性的东西:)

问题更新后更新

所以你可以预处理100,000(例如散列单词)?每天只有 10-20 次更改,因此保持预处理数据最新很容易。

你肯定需要做一些利用 100,000 的相对静态特性的事情。即使您每天只进行一次预处理,您也可以与最近几天的所有记录进行比较,然后对自上次预处理运行以来添加的任何其他记录使用当前较慢的方法。根据你的说法,最多有 10-20 个,

我认为散列想法或从语料库构建 Aho-Comisack trie 会让你的搜索速度更快。

If you split the 10 records first, then you're finding a small number of strings in many larger strings. This seems to fit http://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_finite_set_of_patterns

and the Aho-Corasick algorithm might work well for you

How long are the records?

EDIT:

This is an unnecessary switcharound - your comparison is symmetric wrt firstArray and secondArray

 if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }

instead, replace the return with

return findLongest ? value : (firstArray.Length > secondArray.Length) ? value/secondArray.length : value / firstArray.Length);

only with something more readable :)

UPDATE after question update

So you could pre-process the 100,000 (e.g. to hash the words)? And only 10-20 change per day so keeping the preprocessed data up to date would be easy.

You definitely need to do something that uses the relatively-static nature of the 100,000. Even if you did the pre-processing just once per day, you could do the comparison with all of last days' records, then use your current slowish approach for any others added since the last preprocessing run. From what you say, there will be at most 10-20 of those

I think either the hashing idea, or building a Aho-Comisack trie from the corpus would give you much faster searching.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文