PHP 中的字符串相似度:类似 levenshtein 的长字符串函数
PHP 中的函数 levenshtein
适用于最大长度为 255 的字符串。在 PHP 中计算句子的相似度分数有哪些好的替代方法。
基本上我有一个句子数据库,我想找到近似的重复项。 similar_text
函数没有给我预期的结果。对我来说检测类似句子的最简单方法是什么:
$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";
$ss=strtolower($ss); // convert to lower case as we dont care about case
$pp=strtolower($pp);
$score=similar_text($ss, $pp);
echo "$score %\n"; // Outputs just 29 %
$score=levenshtein ( $ss, $pp );
echo "$score\n"; // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(
The function levenshtein
in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP.
Basically I have a database of sentences, and I want to find approximate duplicates.similar_text
function is not giving me expected results. What is the easiest way for me to detect similar sentences like below:
$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";
$ss=strtolower($ss); // convert to lower case as we dont care about case
$pp=strtolower($pp);
$score=similar_text($ss, $pp);
echo "$score %\n"; // Outputs just 29 %
$score=levenshtein ( $ss, $pp );
echo "$score\n"; // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
levenshtein
算法的时间复杂度为O(n*m)
,其中n
和m
是长度两个输入字符串的。这是相当昂贵的,并且计算长字符串的这样的距离将需要很长时间。对于整个句子,您可能需要使用
diff
算法,例如:在 PHP 中突出显示两个字符串之间的区别话虽如此,PHP 还提供了
similar_text
函数的复杂度更差 (O(max(n,m)**3)) 但似乎适用于较长的字符串。
The
levenshtein
algorithm has a time complexity ofO(n*m)
, wheren
andm
are the lengths of the two input strings. This is pretty expensive and computing such a distance for long strings will take a long time.For whole sentences, you might want to use a
diff
algorithm instead, see for example: Highlight the difference between two strings in PHPHaving said this, PHP also provides the
similar_text
function which has an even worse complexity (O(max(n,m)**3)
) but seems to work on longer strings.我发现 Smith Waterman Gotoh 是最佳算法比较句子。更多信息在此答案中。这是 PHP 代码示例:
I've found the Smith Waterman Gotoh to be the best algorithm for comparing sentences. More info in this answer. Here is the PHP code example:
您可以尝试使用similar_text。
如果有 20,000 多个字符(3-5 秒),它可能会变得相当慢,但你提到的示例仅使用句子,这对于这种用法来说效果很好。
需要注意的一件事是,在比较不同大小的字符串时,您不会得到 100% 的结果。例如,如果您将“he”与“head”进行比较,您只会得到 50% 的匹配度。
You could try using similar_text.
It can get quite slow with 20,000+ characters (3-5 seconds) but your example you mention using only sentences, this will work just fine for that usage.
One thing to note is when comparing string of different sizes you will not get 100%. For example if you compare "he" with "head" you would only get a 50% match.