PHP 中的字符串相似度:类似 levenshtein 的长字符串函数

发布于 2024-10-19 00:30:50 字数 608 浏览 4 评论 0原文

PHP 中的函数 levenshtein 适用于最大长度为 255 的字符串。在 PHP 中计算句子的相似度分数有哪些好的替代方法。

基本上我有一个句子数据库,我想找到近似的重复项。 similar_text 函数没有给我预期的结果。对我来说检测类似句子的最简单方法是什么:

$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";

$ss=strtolower($ss);  // convert to lower case as we dont care about case
$pp=strtolower($pp);

$score=similar_text($ss, $pp);
echo "$score %\n";  // Outputs just 29 %

$score=levenshtein ( $ss, $pp );
echo "$score\n";  // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(

The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP.

Basically I have a database of sentences, and I want to find approximate duplicates.
similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below:

$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";

$ss=strtolower($ss);  // convert to lower case as we dont care about case
$pp=strtolower($pp);

$score=similar_text($ss, $pp);
echo "$score %\n";  // Outputs just 29 %

$score=levenshtein ( $ss, $pp );
echo "$score\n";  // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

回眸一遍 2024-10-26 00:30:50

levenshtein 算法的时间复杂度为 O(n*m),其中 nm 是长度两个输入字符串的。这是相当昂贵的,并且计算长字符串的这样的距离将需要很长时间。

对于整个句子,您可能需要使用 diff 算法,例如:在 PHP 中突出显示两个字符串之间的区别

话虽如此,PHP 还提供了 similar_text 函数的复杂度更差 (O(max(n,m)**3)) 但似乎适用于较长的字符串。

The levenshtein algorithm has a time complexity of O(n*m), where n and m are the lengths of the two input strings. This is pretty expensive and computing such a distance for long strings will take a long time.

For whole sentences, you might want to use a diff algorithm instead, see for example: Highlight the difference between two strings in PHP

Having said this, PHP also provides the similar_text function which has an even worse complexity (O(max(n,m)**3)) but seems to work on longer strings.

星星的軌跡 2024-10-26 00:30:50

我发现 Smith Waterman Gotoh 是最佳算法比较句子。更多信息在此答案中。这是 PHP 代码示例:

class SmithWatermanGotoh 
{
    private $gapValue;
    private $substitution;

    /**
     * Constructs a new Smith Waterman metric.
     * 
     * @param gapValue
     *            a non-positive gap penalty
     * @param substitution
     *            a substitution function
     */
    public function __construct($gapValue=-0.5, 
                $substitution=null) 
    {
        if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
        //if(empty($substitution)) throw new Exception("substitution is required");
        if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
        else $this->substitution = $substitution;
        $this->gapValue = $gapValue;
    }

    public function compare($a, $b) 
    {
        if (empty($a) && empty($b)) {
            return 1.0;
        }

        if (empty($a) || empty($b)) {
            return 0.0;
        }

        $maxDistance = min(mb_strlen($a), mb_strlen($b))
                * max($this->substitution->max(), $this->gapValue);
        return $this->smithWatermanGotoh($a, $b) / $maxDistance;
    }

    private function smithWatermanGotoh($s, $t) 
    {   
        $v0 = [];
        $v1 = [];
        $t_len = mb_strlen($t);
        $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));

        for ($j = 1; $j < $t_len; $j++) {
            $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
                    $this->substitution->compare($s, 0, $t, $j));

            $max = max($max, $v0[$j]);
        }

        // Find max
        for ($i = 1; $i < mb_strlen($s); $i++) {
            $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));

            $max = max($max, $v1[0]);

            for ($j = 1; $j < $t_len; $j++) {
                $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
                        $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));

                $max = max($max, $v1[$j]);
            }

            for ($j = 0; $j < $t_len; $j++) {
                $v0[$j] = $v1[$j];
            }
        }

        return $max;
    }
}

class SmithWatermanMatchMismatch
{
    private $matchValue;
    private $mismatchValue;

    /**
     * Constructs a new match-mismatch substitution function. When two
     * characters are equal a score of <code>matchValue</code> is assigned. In
     * case of a mismatch a score of <code>mismatchValue</code>. The
     * <code>matchValue</code> must be strictly greater then
     * <code>mismatchValue</code>
     * 
     * @param matchValue
     *            value when characters are equal
     * @param mismatchValue
     *            value when characters are not equal
     */
    public function __construct($matchValue, $mismatchValue) {
        if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");

        $this->matchValue = $matchValue;
        $this->mismatchValue = $mismatchValue;
    }

    public function compare($a, $aIndex, $b, $bIndex) {
        return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
                : $this->mismatchValue);
    }

    public function max() {
        return $this->matchValue;
    }

    public function min() {
        return $this->mismatchValue;
    }
}

$str1 = "Jack is a very nice boy, isn't he?";
$str2 = "jack is a very nice boy is he";
$o = new SmithWatermanGotoh();
echo $o->compare($str1, $str2);

I've found the Smith Waterman Gotoh to be the best algorithm for comparing sentences. More info in this answer. Here is the PHP code example:

class SmithWatermanGotoh 
{
    private $gapValue;
    private $substitution;

    /**
     * Constructs a new Smith Waterman metric.
     * 
     * @param gapValue
     *            a non-positive gap penalty
     * @param substitution
     *            a substitution function
     */
    public function __construct($gapValue=-0.5, 
                $substitution=null) 
    {
        if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
        //if(empty($substitution)) throw new Exception("substitution is required");
        if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
        else $this->substitution = $substitution;
        $this->gapValue = $gapValue;
    }

    public function compare($a, $b) 
    {
        if (empty($a) && empty($b)) {
            return 1.0;
        }

        if (empty($a) || empty($b)) {
            return 0.0;
        }

        $maxDistance = min(mb_strlen($a), mb_strlen($b))
                * max($this->substitution->max(), $this->gapValue);
        return $this->smithWatermanGotoh($a, $b) / $maxDistance;
    }

    private function smithWatermanGotoh($s, $t) 
    {   
        $v0 = [];
        $v1 = [];
        $t_len = mb_strlen($t);
        $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));

        for ($j = 1; $j < $t_len; $j++) {
            $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
                    $this->substitution->compare($s, 0, $t, $j));

            $max = max($max, $v0[$j]);
        }

        // Find max
        for ($i = 1; $i < mb_strlen($s); $i++) {
            $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));

            $max = max($max, $v1[0]);

            for ($j = 1; $j < $t_len; $j++) {
                $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
                        $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));

                $max = max($max, $v1[$j]);
            }

            for ($j = 0; $j < $t_len; $j++) {
                $v0[$j] = $v1[$j];
            }
        }

        return $max;
    }
}

class SmithWatermanMatchMismatch
{
    private $matchValue;
    private $mismatchValue;

    /**
     * Constructs a new match-mismatch substitution function. When two
     * characters are equal a score of <code>matchValue</code> is assigned. In
     * case of a mismatch a score of <code>mismatchValue</code>. The
     * <code>matchValue</code> must be strictly greater then
     * <code>mismatchValue</code>
     * 
     * @param matchValue
     *            value when characters are equal
     * @param mismatchValue
     *            value when characters are not equal
     */
    public function __construct($matchValue, $mismatchValue) {
        if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");

        $this->matchValue = $matchValue;
        $this->mismatchValue = $mismatchValue;
    }

    public function compare($a, $aIndex, $b, $bIndex) {
        return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
                : $this->mismatchValue);
    }

    public function max() {
        return $this->matchValue;
    }

    public function min() {
        return $this->mismatchValue;
    }
}

$str1 = "Jack is a very nice boy, isn't he?";
$str2 = "jack is a very nice boy is he";
$o = new SmithWatermanGotoh();
echo $o->compare($str1, $str2);
与之呼应 2024-10-26 00:30:50

您可以尝试使用similar_text
如果有 20,000 多个字符(3-5 秒),它可能会变得相当慢,但你提到的示例仅使用句子,这对于这种用法来说效果很好。

需要注意的一件事是,在比较不同大小的字符串时,您不会得到 100% 的结果。例如,如果您将“he”与“head”进行比较,您只会得到 50% 的匹配度。

You could try using similar_text.
It can get quite slow with 20,000+ characters (3-5 seconds) but your example you mention using only sentences, this will work just fine for that usage.

One thing to note is when comparing string of different sizes you will not get 100%. For example if you compare "he" with "head" you would only get a 50% match.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文