C# 中的模糊文本(句子/标题)匹配
嘿,我正在使用 Levenshteins 算法来获取源字符串和目标字符串之间的距离。
我还有返回从 0 到 1 的值的方法:
/// <summary>
/// Gets the similarity between two strings.
/// All relation scores are in the [0, 1] range,
/// which means that if the score gets a maximum value (equal to 1)
/// then the two string are absolutely similar
/// </summary>
/// <param name="string1">The string1.</param>
/// <param name="string2">The string2.</param>
/// <returns></returns>
public static float CalculateSimilarity(String s1, String s2)
{
if ((s1 == null) || (s2 == null)) return 0.0f;
float dis = LevenshteinDistance.Compute(s1, s2);
float maxLen = s1.Length;
if (maxLen < s2.Length)
maxLen = s2.Length;
if (maxLen == 0.0F)
return 1.0F;
else return 1.0F - dis / maxLen;
}
但这对我来说还不够。 因为我需要更复杂的方式来匹配两个句子。
例如,我想自动标记一些音乐,我有原始歌曲名称,并且我有一些带有垃圾的歌曲,例如超级,年份,例如2007年、2008年等。等等..还有一些文件只有 http://trash..thash..song_name_mp3.mp3,其他正常。 我想创建一个比我现在更完美的算法。也许有人可以帮助我?
这是我当前的算法:
/// <summary>
/// if we need to ignore this target.
/// </summary>
/// <param name="targetString">The target string.</param>
/// <returns></returns>
private bool doIgnore(String targetString)
{
if ((targetString != null) && (targetString != String.Empty))
{
for (int i = 0; i < ignoreWordsList.Length; ++i)
{
//* if we found ignore word or target string matching some some special cases like years (Regex).
if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true;
}
}
return false;
}
/// <summary>
/// Removes the duplicates.
/// </summary>
/// <param name="list">The list.</param>
private void removeDuplicates(List<String> list)
{
if ((list != null) && (list.Count > 0))
{
for (int i = 0; i < list.Count - 1; ++i)
{
if (list[i] == list[i + 1])
{
list.RemoveAt(i);
--i;
}
}
}
}
/// <summary>
/// Does the fuzzy match.
/// </summary>
/// <param name="targetTitle">The target title.</param>
/// <returns></returns>
private TitleMatchResult doFuzzyMatch(String targetTitle)
{
TitleMatchResult matchResult = null;
if (targetTitle != null && targetTitle != String.Empty)
{
try
{
//* change target title (string) to lower case.
targetTitle = targetTitle.ToLower();
//* scores, we will select higher score at the end.
Dictionary<Title, float> scores = new Dictionary<Title, float>();
//* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_'
List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));
//* remove all trash from keywords, like super, quality, etc..
targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); });
//* sort keywords.
targetKeywords.Sort();
//* remove some duplicates.
removeDuplicates(targetKeywords);
//* go through all original titles.
foreach (Title sourceTitle in titles)
{
float tempScore = 0f;
//* split orig. title to keywords list.
List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));
sourceKeywords.Sort();
removeDuplicates(sourceKeywords);
//* go through all source ttl keywords.
foreach (String keyw1 in sourceKeywords)
{
float max = float.MinValue;
foreach (String keyw2 in targetKeywords)
{
float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2);
if (currentScore > max)
{
max = currentScore;
}
}
tempScore += max;
}
//* calculate average score.
float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count));
//* if average score is bigger than minimal score and target title is not in this source title ignore list.
if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle))
{
//* add score.
scores.Add(sourceTitle, averageScore);
}
}
//* choose biggest score.
float maxi = float.MinValue;
foreach (KeyValuePair<Title, float> kvp in scores)
{
if (kvp.Value > maxi)
{
maxi = kvp.Value;
matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic);
}
}
}
catch { }
}
//* return result.
return matchResult;
}
这正常工作,但只是在某些情况下,很多标题应该匹配,不匹配......我认为我需要某种公式来处理权重等,但我不认为一个..
想法? 建议? 算法?
顺便说一句,我已经知道这个主题(我的同事已经发布了它,但我们无法为这个问题提供适当的解决方案。): 近似字符串匹配算法
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
有一个 GitHub 存储库 实现了多种方法。
There is a GitHub repo implementing several methods.
在 DNA 序列比对的某些相关问题上做了很多工作(搜索“局部序列比对”)——经典算法是“Needleman-Wunsch”,更复杂的现代算法也很容易找到。 这个想法是 - 类似于格雷格的答案 - 不是识别和比较关键字,而是尝试在长字符串中找到最长的松散匹配子字符串。
遗憾的是,如果唯一的目标是对音乐进行排序,那么覆盖可能的命名方案的许多正则表达式可能比任何通用算法都更好。
There's a lot of work done on somewhat related problem of DNA sequence alignment (search for "local sequence alignment") - classic algorithm being "Needleman-Wunsch" and more complex modern ones also easy to find. The idea is - similar to Greg's answer - instead of identifying and comparing keywords try to find longest loosely matching substrings within long strings.
That being sad, if the only goal is sorting music, a number of regular expressions to cover possible naming schemes would probably work better than any generic algorithm.
您的问题可能是区分干扰词和有用数据:
您可能需要生成要忽略的干扰词字典。 这看起来很笨重,但我不确定是否有一种算法可以区分乐队/专辑名称和噪音。
Your problem here may be distinguishing between noise words and useful data:
You may need to produce a dictionary of noise words to ignore. That seems clunky, but I'm not sure there's an algorithm that can distinguish between band/album names and noise.
听起来你想要的可能是最长的子串匹配。 也就是说,在您的示例中,有两个文件,例如:
trash..thash..song_name_mp3.mp3
和
垃圾..spotch..song_name_mp3.mp3
最终看起来是一样的。
当然,你需要一些启发式的方法。 您可以尝试的一件事是将字符串通过 soundex 转换器。 Soundex 是“编解码器”,用于查看事物“听起来”是否相同(就像您可能告诉电话接线员的那样)。 这或多或少是一个粗糙的语音和发音错误的半校音译。 它肯定比编辑距离差,但便宜得多。 (官方用途是用于名称,并且仅使用三个字符。不过,没有理由就此停止,只需对字符串中的每个字符使用映射即可。请参阅 wikipedia 了解详细信息)
所以我的建议是对你的字符串进行 soundex,将每个字符串切成几个长度的部分(例如 5、10、20),然后只查看集群。 在集群中,您可以使用更昂贵的东西,例如编辑距离或最大子字符串。
It sounds like what you want may be a longest substring match. That is, in your example, two files like
trash..thash..song_name_mp3.mp3
and
garbage..spotch..song_name_mp3.mp3
would end up looking the same.
You'd need some heuristics there, of course. One thing you might try is putting the string through a soundex converter. Soundex is the "codec" used to see if things "sound" the same (as you might tell a telephone operator). It's more or less a rough phonetic and mispronunciation semi-proof transliteration. It is definitely poorer than edit distance, but much, much cheaper. (The official use is for names, and only uses three characters. There's no reason to stop there, though, just use the mapping for every character in the string. See wikipedia for details)
So my suggestion would be to soundex your strings, chop each one into a few length tranches (say 5, 10, 20) and then just look at clusters. Within clusters you can use something more expensive like edit distance or max substring.
有点旧,但它可能对未来的访客有用。 如果您已经在使用 Levenshtein 算法并且需要做得更好一点,我在此解决方案中描述了一些非常有效的启发式算法:
获取最接近的字符串匹配
关键是你想出 3 或 4 个(或 更多) 衡量短语之间相似性的方法(编辑距离只是一种方法) - 然后使用您想要匹配为相似的字符串的真实示例,调整权重并这些启发式的组合,直到你得到最大化积极匹配数量的东西。 然后,您将这个公式用于以后的所有比赛,您应该会看到很好的结果。
如果用户参与该过程,那么最好提供一个界面,允许用户查看相似度较高的其他匹配项,以防他们不同意第一个选择。
这是链接答案的摘录。 如果您最终想按原样使用任何此代码,我提前为必须将 VBA 转换为 C# 表示歉意。
简单、快速且非常有用的指标。 使用它,我创建了两个单独的指标来评估两个字符串的相似性。 一种我称之为“valuePhrase”,另一种我称之为“valueWords”。 valuePhrase 只是两个短语之间的 Levenshtein 距离,valueWords 根据分隔符(例如空格、破折号和任何您想要的其他内容)将字符串拆分为单个单词,并将每个单词与其他单词进行比较,总结出最短的单词连接任意两个单词的编辑距离。 本质上,它衡量一个“短语”中的信息是否真正包含在另一个“短语”中,就像按单词排列一样。 我花了几天时间作为一个业余项目,想出了根据分隔符分割字符串的最有效方法。
valueWords、valuePhrase 和 Split 函数:
相似度测量
使用这两个指标和第三个指标(仅计算两个字符串之间的距离),我有一系列变量,我可以运行优化算法来实现这些变量最多的比赛数。 模糊字符串匹配本身就是一门模糊科学,因此通过创建线性独立的度量来测量字符串相似性,并拥有一组我们希望相互匹配的已知字符串,我们可以找到适合我们特定风格的参数。字符串,给出最佳的模糊匹配结果。
最初,该指标的目标是为精确匹配提供较低的搜索值,并为日益排列的度量增加搜索值。 在不切实际的情况下,使用一组明确定义的排列来定义这一点相当容易,并设计最终公式,以便它们具有根据需要增加的搜索值结果。
如您所见,最后两个指标(模糊字符串匹配指标)已经具有自然趋势为要匹配的字符串(沿对角线)提供低分。 这是非常好的。
应用
为了优化模糊匹配,我对每个指标进行加权。 因此,模糊字符串匹配的每个应用都可以对参数进行不同的加权。 定义最终得分的公式是指标及其权重的简单组合:
使用优化算法(神经网络在这里是最好的,因为它是一个离散的多维问题),现在的目标是最大化匹配数量。 我创建了一个函数来检测每组彼此正确匹配的数量,如最终屏幕截图所示。 如果为要匹配的字符串分配了最低分数,则列或行将获得一分;如果最低分数相同,则给出部分分数,并且正确的匹配位于并列的匹配字符串中。 然后我优化了它。 您可以看到绿色单元格是与当前行最匹配的列,单元格周围的蓝色方块是与当前列最匹配的行。 底角的分数大致是成功匹配的数量,这就是我们告诉优化问题最大化的分数。
Kind of old, but It might be useful to future visitors. If you're already using the Levenshtein algorithm and you need to go a little better, I describe some very effective heuristics in this solution:
Getting the closest string match
The key is that you come up with 3 or 4 (or more) methods of gauging the similarity between your phrases (Levenshtein distance is just one method) - and then using real examples of strings you want to match as similar, you adjust the weightings and combinations of those heuristics until you get something that maximizes the number of positive matches. Then you use that formula for all future matches and you should see great results.
If a user is involved in the process, it's also best if you provide an interface which allows the user to see additional matches that rank highly in similarity in case they disagree with the first choice.
Here's an excerpt from the linked answer. If you end up wanting to use any of this code as is, I apologize in advance for having to convert VBA into C#.
Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.
valueWords, valuePhrase, and Split function:
Measures of Similarity
Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.
Initially, the goal of the metric was to have a low search value for for an exact match, and increasing search values for increasingly permuted measures. In an impractical case, this was fairly easy to define using a set of well defined permutations, and engineering the final formula such that they had increasing search values results as desired.
As you can see, the last two metrics, which are fuzzy string matching metrics, already have a natural tendency to give low scores to strings that are meant to match (down the diagonal). This is very good.
Application
To allow the optimization of fuzzy matching, I weight each metric. As such, every application of fuzzy string match can weight the parameters differently. The formula that defines the final score is a simply combination of the metrics and their weights:
Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.