I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.
I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
{
double theResult = 0;
String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
if (Splitted1.Length < Splitted2.Length)
{
String[] Temp = Splitted2;
Splitted2 = Splitted1;
Splitted1 = Temp;
}
int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.
for (int loop = 0; loop < Splitted1.Length; loop++)
{
for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
BestWord[loop] = -1;
}
int WordsMatched = 0;
for (int loop = 0; loop < Splitted1.Length; loop++)
{
String String1 = Splitted1[loop];
for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
{
String String2 = Splitted2[loop1];
int LevenshteinDistance = Compute(String1, String2);
theScores[loop, loop1] = LevenshteinDistance;
if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
}
}
for (int loop = 0; loop < Splitted1.Length; loop++)
{
if (theScores[loop, BestWord[loop]] == 1000) continue;
for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
{
if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
{
//The first in order has the advantage of keeping the word in equality
if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
{
theScores[loop1, BestWord[loop1]] = 1000;
int CurrentBest = -1;
int CurrentScore = 1000;
for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
{
//Find next bestword
if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
{
CurrentBest = loop2;
CurrentScore = theScores[loop1, loop2];
}
}
BestWord[loop1] = CurrentBest;
}
else//the latter has a better score
{
theScores[loop, BestWord[loop]] = 1000;
int CurrentBest = -1;
int CurrentScore = 1000;
for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
{
//Find next bestword
if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
{
CurrentBest = loop2;
CurrentScore = theScores[loop, loop2];
}
}
BestWord[loop] = CurrentBest;
}
loop = -1;
break;//recalculate all
}
}
}
for (int loop = 0; loop < Splitted1.Length; loop++)
{
if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
else
{
theResult += theScores[loop, BestWord[loop]];
if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
}
}
int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
if(theResult > theLength) theResult = theLength;
theResult = (1 - (theResult / theLength)) * 100;
WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
return theResult;
}
Here is some code I have written for a project I am working on. I need to know the Similarity Ratio of the strings and the Similarity Ratio based on words of the strings. This last one, I want to know both the Words Similarity Ratio of the smallest string(so if all words exist and match in the larger string the result will be 100%) and the Words Similarity Ratio of the larger string(which I call RealWordsRatio). I use the Levenshtein algorithm to find the distance. The code is unoptimised, so far, but it works as expected. I hope you find it useful.
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
{
double theResult = 0;
String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
if (Splitted1.Length < Splitted2.Length)
{
String[] Temp = Splitted2;
Splitted2 = Splitted1;
Splitted1 = Temp;
}
int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.
for (int loop = 0; loop < Splitted1.Length; loop++)
{
for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
BestWord[loop] = -1;
}
int WordsMatched = 0;
for (int loop = 0; loop < Splitted1.Length; loop++)
{
String String1 = Splitted1[loop];
for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
{
String String2 = Splitted2[loop1];
int LevenshteinDistance = Compute(String1, String2);
theScores[loop, loop1] = LevenshteinDistance;
if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
}
}
for (int loop = 0; loop < Splitted1.Length; loop++)
{
if (theScores[loop, BestWord[loop]] == 1000) continue;
for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
{
if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
{
//The first in order has the advantage of keeping the word in equality
if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
{
theScores[loop1, BestWord[loop1]] = 1000;
int CurrentBest = -1;
int CurrentScore = 1000;
for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
{
//Find next bestword
if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
{
CurrentBest = loop2;
CurrentScore = theScores[loop1, loop2];
}
}
BestWord[loop1] = CurrentBest;
}
else//the latter has a better score
{
theScores[loop, BestWord[loop]] = 1000;
int CurrentBest = -1;
int CurrentScore = 1000;
for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
{
//Find next bestword
if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
{
CurrentBest = loop2;
CurrentScore = theScores[loop, loop2];
}
}
BestWord[loop] = CurrentBest;
}
loop = -1;
break;//recalculate all
}
}
}
for (int loop = 0; loop < Splitted1.Length; loop++)
{
if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
else
{
theResult += theScores[loop, BestWord[loop]];
if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
}
}
int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
if(theResult > theLength) theResult = theLength;
theResult = (1 - (theResult / theLength)) * 100;
WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
return theResult;
}
Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithms which you might also find useful.
To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.
If you're comparing values in a SQL database you can use the SOUNDEX function. If you query Google for SOUNDEX and C#, some people have written a similar function for that and VB.
Metaphone 3 is the third generation of the Metaphone algorithm. It increases the accuracy of phonetic encoding from the 89% of Double Metaphone to 98%, as tested against a database of the most common English words, and names and non-English words familiar in North America. This produces an extremely reliable phonetic encoding for American pronunciations.
Metaphone 3 was designed and developed by Lawrence Philips, who designed and developed the original Metaphone and Double Metaphone algorithms.
发布评论
评论(12)
有多种不同的方法可以做到这一点。 查看维基百科“字符串相似性度量”页面,获取其他包含算法的页面的链接。
然而,我不认为这些算法中的任何一个都会考虑声音 - 因此“staq 溢出”将与“堆栈溢出”和“staw 溢出”类似,尽管第一个在术语上更相似的发音。
我刚刚找到另一个页面,它提供了更多选项。特别是 Soundex 算法 (Wikipedia)可能更接近您所追求的内容。
There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" page for links to other pages with algorithms.
I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.
I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.
Levenshtein 距离 可能就是您正在寻找的。
Levenshtein distance is probably what you're looking for.
这是我为我正在从事的项目编写的一些代码。 我需要知道字符串的相似率和基于字符串单词的相似率。
最后一个,我想知道最小字符串的单词相似度(因此,如果所有单词都存在并在较大字符串中匹配,则结果将为 100%)和较大字符串的单词相似度(我称之为 RealWordsRatio) )。
我使用 Levenshtein 算法来计算距离。 到目前为止,代码尚未优化,但它按预期工作。 希望对你有帮助。
Here is some code I have written for a project I am working on. I need to know the Similarity Ratio of the strings and the Similarity Ratio based on words of the strings.
This last one, I want to know both the Words Similarity Ratio of the smallest string(so if all words exist and match in the larger string the result will be 100%) and the Words Similarity Ratio of the larger string(which I call RealWordsRatio).
I use the Levenshtein algorithm to find the distance. The code is unoptimised, so far, but it works as expected. I hope you find it useful.
不久前,我用 C# 编写了一个 Double Metaphone 实现。 您会发现它远远优于 Soundex 等。
编辑距离也被提出,它是一个有很多用途的伟大算法,但语音匹配并不是它真正的作用; 只是有时看起来是这样,因为语音相似的单词通常拼写也相似。 我对各种模糊匹配算法进行了分析,您可能也会发现它很有用。
I wrote a Double Metaphone implementation in C# a while back. You'll find it vastly superior to Soundex and the like.
Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithms which you might also find useful.
为了处理“声音相似”的问题,您可能需要考虑使用语音算法(例如 Double Metaphone 或 soundex)进行编码。 我不知道在语音编码字符串上计算编辑距离是否有益,但可能是一种可能性。 或者,您可以使用启发式方法,例如:将字符串中的每个单词转换为其编码形式,并删除两个字符串中出现的任何单词,并在计算编辑距离之前将它们替换为单个表示形式。
To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.
您可能会查找字符串“距离”,例如编辑距离。
You might look for string "distances", for example the Levenshtein distance.
Perl 模块 Text::Phonetic 具有各种算法的实现。
Perl module Text::Phonetic has implementations of various algorithms.
Jeff Atwood 写过关于寻找类似解决方案来确定 wiki 帖子的作者身份的文章这可能会帮助您缩小搜索范围。
Jeff Atwood wrote about looking for a similar solution for determining the authorship of wiki posts which may help you narrow your search.
如果您要比较 SQL 数据库中的值,则可以使用 SOUNDEX 函数。 如果您在 Google 上查询 SOUNDEX 和 C#,有些人已经为此和 VB 编写了类似的函数。
If you're comparing values in a SQL database you can use the SOUNDEX function. If you query Google for SOUNDEX and C#, some people have written a similar function for that and VB.
我也必须推荐 Soundex,我过去曾用它来处理拼写错误的城市名称。 这是一个很好的使用链接:http://whitepapers.zdnet.com/abstract。 aspx?docid=352953
I have to recommend Soundex too, I have used it in the past to process misspelt city names. Here is a good link for usage: http://whitepapers.zdnet.com/abstract.aspx?docid=352953
如果您想进行语音比较,请查看 Soundex 和 Metaphone 算法: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex
If you want to compare phonetically, check out the Soundex and Metaphone algorithms: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex