php 中找到最相似字符串的最佳方法?
天哪,
PHP 有很多字符串函数,例如 levenshtein、similar_text 和 soundex,可以比较字符串的相似性。 http://www.php.net/manual/en/function.levenshtein.php
哪一个的准确性和性能最好?
Hell,
PHP has a lot of string functions like levenshtein, similar_text and soundex that can compare strings for similarity.
http://www.php.net/manual/en/function.levenshtein.php
Which is the best for accuracy and performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
simple_text 的复杂度为 O(max(n,m)**3),levenshtein 的复杂度为 O(m*n),其中 n 和 m 是字符串的长度,因此 levenshtein 应该更快。两者都是 100% 准确的,因为它们对于相同的输入给出相同的输出,但每个函数的输出会有所不同。如果您使用不同的准确性度量,则必须创建自己的比较函数。
similar_text has a complexity O(max(n,m)**3) and levenshtein a complexity of O(m*n), where n and m are the lengths of the strings, so levenshtein should be much faster. Both are 100% accurate, in that they give the same output for the same input, but the outputs for each function will differ. If you are using a different measure of accuracy, you'll have to create your own comparison function.
您没有描述您的用例,但在许多情况下,当我们谈论自然语言时,单词比字符更重要,因此
similar_text()
和levenshtein()
都可能以非常高的计算成本给出不太有意义的结果。例如,使用上面的几千篇文章在数据库中搜索具有相似标题的文章很容易堵塞服务器。
我通常做的是编写一个简单的函数,它接受两个字符串,将它们在空格处分割成数组,并计算交集以获得低 CPU 成本的更自然的匹配分数。
只需很少的改进,它就可以在多个用例中真正表现出色,例如在从其他内容中过滤出来的博客中快速提供推荐文章。
我通常实施的改进:
这并不完美,但为了进行比较,该算法处理 cca。 5000,000 篇博客文章,并给出了 3 篇非常好的类似文章,没有明显的性能影响,而在同一服务器上使用 levenshtein 执行相同的操作需要 10-15 秒,这对于网页加载来说显然是不可接受的。
如果您需要差异而不是相似性,则分数可以倒数,或者您可以仅使用数组差异后的非匹配项,而不是数组相交后的匹配项的计数。
You did not describe your use-case, but in many cases when we speak about natural language words are more important than characters, so both
similar_text()
andlevenshtein()
may give less meaningful results at a very high calculation cost.For example to search articles with similar title in a database using those above with few thousand articles can clog up a server easily.
What I usually do is to write a simple function that accepts two strings, splits them at whitespaces into arrays and count the intersection to get a low-cpu-cost more natural matching score.
With few improvements it can really excel in several use-cases, such as quickly give recommended articles in a blog filtered from other content.
Improvements I usually implement:
It is not perfect, but for comparison this algo processes cca. 5000 thousand blog posts and gives 3 very good similar article with no noticable performance impact while on the same server doing the same with levenshtein takes good 10-15 seconds which is obviously not acceptable for a webpage loading.
And if you need difference instead of similarity, the score can be reciprocated or you could just use the non-matching terms after an array diff instead of the count of matching terms after an array intersect.