php 中找到最相似字符串的最佳方法?

发布于 2024-10-16 19:01:57 字数 245 浏览 10 评论 0原文

天哪,

PHP 有很多字符串函数,例如 levenshtein、similar_text 和 soundex,可以比较字符串的相似性。 http://www.php.net/manual/en/function.levenshtein.php

哪一个的准确性和性能最好?

Hell,

PHP has a lot of string functions like levenshtein, similar_text and soundex that can compare strings for similarity.
http://www.php.net/manual/en/function.levenshtein.php

Which is the best for accuracy and performance?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

﹂绝世的画 2024-10-23 19:01:57

simple_text 的复杂度为 O(max(n,m)**3),levenshtein 的复杂度为 O(m*n),其中 n 和 m 是字符串的长度,因此 levenshtein 应该更快。两者都是 100% 准确的,因为它们对于相同的输入给出相同的输出,但每个函数的输出会有所不同。如果您使用不同的准确性度量,则必须创建自己的比较函数。

similar_text has a complexity O(max(n,m)**3) and levenshtein a complexity of O(m*n), where n and m are the lengths of the strings, so levenshtein should be much faster. Both are 100% accurate, in that they give the same output for the same input, but the outputs for each function will differ. If you are using a different measure of accuracy, you'll have to create your own comparison function.

不忘初心 2024-10-23 19:01:57

您没有描述您的用例,但在许多情况下,当我们谈论自然语言时,单词比字符更重要,因此 similar_text()levenshtein() 都可能以非常高的计算成本给出不太有意义的结果。

例如,使用上面的几千篇文章在数据库中搜索具有相似标题的文章很容易堵塞服务器。

我通常做的是编写一个简单的函数,它接受两个字符串,将它们在空格处分割成数组,并计算交集以获得低 CPU 成本的更自然的匹配分数。

只需很少的改进,它就可以在多个用例中真正表现出色,例如在从其他内容中过滤出来的博客中快速提供推荐文章。

我通常实施的改进:

  • 小写字符串
  • 根据匹配元素的长度的 2 次方给出分数,考虑到较长的字符串更难匹配,而且它们往往表明主题之间更有意义的相似性,
  • 抛出仅调节的常见单词比较之前的含义 - 这是特定于语言的,在英语中它可能是一个列表,例如:was、were、no、not、than、then、here、there 等。
  • 在比较之前丢弃字符串中的所有标点符号
  • 在处理合成时, 可能附加各种结尾的语言通过在选择交集之前按最常见的后缀长度截断的单词变体来丰富单词数组。

这并不完美,但为了进行比较,该算法处理 cca。 5000,000 篇博客文章,并给出了 3 篇非常好的类似文章,没有明显的性能影响,而在同一服务器上使用 levenshtein 执行相同的操作需要 10-15 秒,这对于网页加载来说显然是不可接受的。

如果您需要差异而不是相似性,则分数可以倒数,或者您可以仅使用数组差异后的非匹配项,而不是数组相交后的匹配项的计数。

You did not describe your use-case, but in many cases when we speak about natural language words are more important than characters, so both similar_text() and levenshtein() may give less meaningful results at a very high calculation cost.

For example to search articles with similar title in a database using those above with few thousand articles can clog up a server easily.

What I usually do is to write a simple function that accepts two strings, splits them at whitespaces into arrays and count the intersection to get a low-cpu-cost more natural matching score.

With few improvements it can really excel in several use-cases, such as quickly give recommended articles in a blog filtered from other content.

Improvements I usually implement:

  • lowercase the strings
  • give score by the matched element's length raised to the power of 2, considering the fact that longer strings are harder to match also they tend to indicate a more meaningful similarity between topics
  • throw out common words that only modulate meaning before comparison - this is language specific, in English it may be a list such as: was, were, no, not, than, then, here, there etc.
  • throw out all punctuation marks from the strings before comparison
  • when dealing with synthetic languages which may attach various endings enrich the array of words with variants of words truncated by the most common suffix lengths before selecting their intersection

It is not perfect, but for comparison this algo processes cca. 5000 thousand blog posts and gives 3 very good similar article with no noticable performance impact while on the same server doing the same with levenshtein takes good 10-15 seconds which is obviously not acceptable for a webpage loading.

And if you need difference instead of similarity, the score can be reciprocated or you could just use the non-matching terms after an array diff instead of the count of matching terms after an array intersect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文