PHP 中的 Schinke 拉丁语词干算法

发布于 2024-09-27 12:46:57 字数 3643 浏览 6 评论 0原文

本网站提供“Schinke Latin Stemming Algorithm”下载，以便在Snowball 词干系统。

我想使用这个算法，但我不想使用 Snowball。

好处是：该页面上有一些伪代码，您可以将其转换为 PHP功能。这是我尝试过的：

<?php
function stemLatin($word) {
    // output = array(NOUN-BASED STEM, VERB-BASED STEM)
    // DEFINE CLASSES BEGIN
    $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque');
    $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u');
    $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't');
    // DEFINE CLASSES END
    $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind
    $word = str_replace('j', 'i', $word); // replace all <j> by <i>
    $word = str_replace('v', 'u', $word); // replace all <v> by <u>
    if (substr($word, -3) == 'que') { // if word ends with -que
        if (in_array($word, $queWords)) { // if word is a queWord
            return array($word, $word); // output queWord as both noun-based and verb-based stem
        }
        else {
            $word = substr($word, 0, -3); // remove the -que
        }
    }
    foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            $word = substr($word, 0, -strlen($suffixA)); // remove the suffix
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters
    foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            switch ($suffixB) {
                case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i>
                case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi>
                case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri>
                default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix
            }
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters
    return array($nounBased, $verbBased);
}
?>

我的问题：

1）这段代码能正常工作吗？它遵循算法规则吗？

2）如何改进代码（性能）？

预先非常感谢您！

原文

This website offers the "Schinke Latin stemming algorithm" for download to use it in the Snowball stemming system.

I want to use this algorithm, but I don't want to use Snowball.

The good thing: There's some pseudocode on that page which you could translate to a PHP function. This is what I've tried:

<?php
function stemLatin($word) {
    // output = array(NOUN-BASED STEM, VERB-BASED STEM)
    // DEFINE CLASSES BEGIN
    $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque');
    $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u');
    $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't');
    // DEFINE CLASSES END
    $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind
    $word = str_replace('j', 'i', $word); // replace all <j> by <i>
    $word = str_replace('v', 'u', $word); // replace all <v> by <u>
    if (substr($word, -3) == 'que') { // if word ends with -que
        if (in_array($word, $queWords)) { // if word is a queWord
            return array($word, $word); // output queWord as both noun-based and verb-based stem
        }
        else {
            $word = substr($word, 0, -3); // remove the -que
        }
    }
    foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            $word = substr($word, 0, -strlen($suffixA)); // remove the suffix
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters
    foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            switch ($suffixB) {
                case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i>
                case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi>
                case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri>
                default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix
            }
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters
    return array($nounBased, $verbBased);
}
?>

My questions:

1) Will this code work correctly? Does it follow the algorithm's rules?

2) How could you improve the code (performance)?

Thank you very much in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

翻身的咸鱼 2024-10-04 12:46:57

不，你的函数将无法工作，它包含语法错误。例如，您有未闭合的引号，并且使用了错误的 switch 语法。

这是我对该函数的重写。由于该页面上的伪算法并不精确，因此我必须进行一些解释。我以本文提到的示例起作用的方式对其进行了解释。

我也做了一些优化。第一个是我定义单词和后缀数组static。因此，对此函数的所有调用都共享相同的数组，这应该具有良好的性能；）

此外，我还调整了数组，以便更有效地使用它们。我更改了 $queWords 数组，以便它可以用于快速哈希表查找，而不是缓慢的 in_array。此外，我已将后缀的长度保存在数组中。因此，您不需要在运行时计算它们（这真的非常慢）。我可能做了更多小的优化。

我不知道这段代码快了多少，但应该快得多。此外，它现在适用于提供的示例。

这是代码：

<?php
    function stemLatin($word) {
        static $queWords = array(
            'atque'         => 1,
            'quoque'        => 1,
            'neque'         => 1,
            'itaque'        => 1,
            'absque'        => 1,
            'apsque'        => 1,
            'abusque'       => 1,
            'adaeque'       => 1,
            'adusque'       => 1,
            'denique'       => 1,
            'deque'         => 1,
            'susque'        => 1,
            'oblique'       => 1,
            'peraeque'      => 1,
            'plenisque'     => 1,
            'quandoque'     => 1,
            'quisque'       => 1,
            'quaeque'       => 1,
            'cuiusque'      => 1,
            'cuique'        => 1,
            'quemque'       => 1,
            'quamque'       => 1,
            'quaque'        => 1,
            'quique'        => 1,
            'quorumque'     => 1,
            'quarumque'     => 1,
            'quibusque'     => 1,
            'quosque'       => 1,
            'quasque'       => 1,
            'quotusquisque' => 1,
            'quousque'      => 1,
            'ubique'        => 1,
            'undique'       => 1,
            'usque'         => 1,
            'uterque'       => 1,
            'utique'        => 1,
            'utroque'       => 1,
            'utribique'     => 1,
            'torque'        => 1,
            'coque'         => 1,
            'concoque'      => 1,
            'contorque'     => 1,
            'detorque'      => 1,
            'decoque'       => 1,
            'excoque'       => 1,
            'extorque'      => 1,
            'obtorque'      => 1,
            'optorque'      => 1,
            'retorque'      => 1,
            'recoque'       => 1,
            'attorque'      => 1,
            'incoque'       => 1,
            'intorque'      => 1,
            'praetorque'    => 1,
        );
        static $suffixesNoun = array(
            'ibus' => 4,
            'ius'  => 3,
            'ae'   => 2,
            'am'   => 2,
            'as'   => 2,
            'em'   => 2,
            'es'   => 2,
            'ia'   => 2,
            'is'   => 2,
            'nt'   => 2,
            'os'   => 2,
            'ud'   => 2,
            'um'   => 2,
            'us'   => 2,
            'a'    => 1,
            'e'    => 1,
            'i'    => 1,
            'o'    => 1,
            'u'    => 1,
        );
        static $suffixesVerb = array(
            'iuntur' => 6,
            'beris'  => 5,
            'erunt'  => 5,
            'untur'  => 5,
            'iunt'   => 4,
            'mini'   => 4,
            'ntur'   => 4,
            'stis'   => 4,
            'bor'    => 3,
            'ero'    => 3,
            'mur'    => 3,
            'mus'    => 3,
            'ris'    => 3,
            'sti'    => 3,
            'tis'    => 3,
            'tur'    => 3,
            'unt'    => 3,
            'bo'     => 2,
            'ns'     => 2,
            'nt'     => 2,
            'ri'     => 2,
            'm'      => 1,
            'r'      => 1,
            's'      => 1,
            't'      => 1,
        );

        $stems = array($word, $word);

        $word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u

        if (substr($word, -3) == 'que') {
            if (isset($queWords[$word])) {
                return array($word, $word);
            }
            $word = substr($word, 0, -3);
        }

        foreach ($suffixesNoun as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                $tmp = substr($word, 0, -$length);

                if (isset($tmp[1]))
                    $stems[0] = $tmp;
                break;
            }
        }

        foreach ($suffixesVerb as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                switch ($suffix) {
                    case 'iuntur':
                    case 'erunt':
                    case 'untur':
                    case 'iunt':
                    case 'unt':
                        $tmp = substr_replace($word, 'i', -$length, $length);
                    break;
                    case 'beris':
                    case 'bor':
                    case 'bo':
                        $tmp = substr_replace($word, 'bi', -$length, $length);
                    break;
                    case 'ero':
                        $tmp = substr_replace($word, 'eri', -$length, $length);
                    break;
                    default:
                        $tmp = substr($word, 0, -$length);
                }

                if (isset($tmp[1]))
                    $stems[1] = $tmp;
                break;
            }
        }

        return $stems;
    }

    var_dump(stemLatin('aquila'));
    var_dump(stemLatin('portat'));
    var_dump(stemLatin('portis'));

No, your function will not work, it contains syntax errors. For example you have unclosed quotes and you use a wrong switch syntax.

Here is my rewrite of the function. As the pseudoalgorithm on that page isn't really precise I had to do some interpreting. I interpreted it in a way that the examples mentioned in this article work.

I also did some optimizations. The first one is that I define the word and suffix arrays static. Thus all calls to this function share the same arrays which should be good fore performance ;)

Furthermore I adjusted the arrays so they can be used more effective. I changed the $queWords array so it can be used for a fast hash-table lookup, not a slow in_array. Furthermore I have saved the lengths for the suffixes in the array. Thus you don't need to compute them at runtime (which is really, really slow). I may have made more minor optimizations.

I don't know how much faster this code is, but it should be much faster. Furthermore it now works on the examples provided.

Here is the code:

<?php
    function stemLatin($word) {
        static $queWords = array(
            'atque'         => 1,
            'quoque'        => 1,
            'neque'         => 1,
            'itaque'        => 1,
            'absque'        => 1,
            'apsque'        => 1,
            'abusque'       => 1,
            'adaeque'       => 1,
            'adusque'       => 1,
            'denique'       => 1,
            'deque'         => 1,
            'susque'        => 1,
            'oblique'       => 1,
            'peraeque'      => 1,
            'plenisque'     => 1,
            'quandoque'     => 1,
            'quisque'       => 1,
            'quaeque'       => 1,
            'cuiusque'      => 1,
            'cuique'        => 1,
            'quemque'       => 1,
            'quamque'       => 1,
            'quaque'        => 1,
            'quique'        => 1,
            'quorumque'     => 1,
            'quarumque'     => 1,
            'quibusque'     => 1,
            'quosque'       => 1,
            'quasque'       => 1,
            'quotusquisque' => 1,
            'quousque'      => 1,
            'ubique'        => 1,
            'undique'       => 1,
            'usque'         => 1,
            'uterque'       => 1,
            'utique'        => 1,
            'utroque'       => 1,
            'utribique'     => 1,
            'torque'        => 1,
            'coque'         => 1,
            'concoque'      => 1,
            'contorque'     => 1,
            'detorque'      => 1,
            'decoque'       => 1,
            'excoque'       => 1,
            'extorque'      => 1,
            'obtorque'      => 1,
            'optorque'      => 1,
            'retorque'      => 1,
            'recoque'       => 1,
            'attorque'      => 1,
            'incoque'       => 1,
            'intorque'      => 1,
            'praetorque'    => 1,
        );
        static $suffixesNoun = array(
            'ibus' => 4,
            'ius'  => 3,
            'ae'   => 2,
            'am'   => 2,
            'as'   => 2,
            'em'   => 2,
            'es'   => 2,
            'ia'   => 2,
            'is'   => 2,
            'nt'   => 2,
            'os'   => 2,
            'ud'   => 2,
            'um'   => 2,
            'us'   => 2,
            'a'    => 1,
            'e'    => 1,
            'i'    => 1,
            'o'    => 1,
            'u'    => 1,
        );
        static $suffixesVerb = array(
            'iuntur' => 6,
            'beris'  => 5,
            'erunt'  => 5,
            'untur'  => 5,
            'iunt'   => 4,
            'mini'   => 4,
            'ntur'   => 4,
            'stis'   => 4,
            'bor'    => 3,
            'ero'    => 3,
            'mur'    => 3,
            'mus'    => 3,
            'ris'    => 3,
            'sti'    => 3,
            'tis'    => 3,
            'tur'    => 3,
            'unt'    => 3,
            'bo'     => 2,
            'ns'     => 2,
            'nt'     => 2,
            'ri'     => 2,
            'm'      => 1,
            'r'      => 1,
            's'      => 1,
            't'      => 1,
        );

        $stems = array($word, $word);

        $word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u

        if (substr($word, -3) == 'que') {
            if (isset($queWords[$word])) {
                return array($word, $word);
            }
            $word = substr($word, 0, -3);
        }

        foreach ($suffixesNoun as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                $tmp = substr($word, 0, -$length);

                if (isset($tmp[1]))
                    $stems[0] = $tmp;
                break;
            }
        }

        foreach ($suffixesVerb as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                switch ($suffix) {
                    case 'iuntur':
                    case 'erunt':
                    case 'untur':
                    case 'iunt':
                    case 'unt':
                        $tmp = substr_replace($word, 'i', -$length, $length);
                    break;
                    case 'beris':
                    case 'bor':
                    case 'bo':
                        $tmp = substr_replace($word, 'bi', -$length, $length);
                    break;
                    case 'ero':
                        $tmp = substr_replace($word, 'eri', -$length, $length);
                    break;
                    default:
                        $tmp = substr($word, 0, -$length);
                }

                if (isset($tmp[1]))
                    $stems[1] = $tmp;
                break;
            }
        }

        return $stems;
    }

    var_dump(stemLatin('aquila'));
    var_dump(stemLatin('portat'));
    var_dump(stemLatin('portis'));

回复收藏 0 原文

花想c 2024-10-04 12:46:57

据我所知，这遵循您链接中描述的算法，并且应该可以正常工作。（除了 $suffixesA 定义中的语法错误 - 您缺少几个撇号。）

在性能方面，看起来这里没有太多收获，但是有我想到了一些事情。

如果在脚本的单次执行期间会多次调用此函数，则通过在函数外部定义这些数组可能会有所收获 - 我认为 PHP 不够智能，无法在函数调用之间缓存这些数组。

您还可以将这两个 str_replace 合并为一个：$word = str_replace(array('j','v'), array('i','u'), $word );，或者，由于您要用单个字符替换单个字符，因此您可以使用 $word = strtr($word,'jv','iu'); - 但我我认为这在实践中不会产生太大影响。你必须尝试一下才能确定。