如何使用 PHP 减少数组中包含的相似短语的数量?

发布于 2024-11-29 08:01:48 字数 528 浏览 2 评论 0原文

我有一个包含短语(几个到数百个)的数组。

示例:

adhesive materials
adhesive material
material adhesive
adhesive applicator
adhesive applicators
adhesive applications
adhesive application
adhesives applications
adhesive application systems
adhesive application system

以编程方式,使用 PHP,我想使用诸如词干之类的方法将上面的列表减少到以下列表(某些变化是可以接受的,例如,粘合剂涂抹器和粘合剂应用程序可能很难彼此区分,因为词干是相同):

adhesive material
material adhesive
adhesive applicator
adhesive application
adhesive application system

最好的方法是什么?

I have an array containing phrases (a few to hundreds).

Example:

adhesive materials
adhesive material
material adhesive
adhesive applicator
adhesive applicators
adhesive applications
adhesive application
adhesives applications
adhesive application systems
adhesive application system

Programmatically, using PHP, I'd like to reduce the above list to the following list using something like word stemming (some variation is acceptable, eg. adhesive applicator and adhesive application may be difficult to distinguish from one another since the stem is the same):

adhesive material
material adhesive
adhesive applicator
adhesive application
adhesive application system

What is the best way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

屋顶上的小猫咪 2024-12-06 08:01:48

您需要确定一个最低阈值,然后使用 levenshtein 函数来确定单词的接近程度。

看起来您或多或少会这样做:

$origs = array();
// assuming your example is an array already.
foreach( $setList as $set )
{
    $pieces = explode( ' ', $set );
    $add = true;
    foreach( $origs as $keySet )
    {
        if( levenshtein( $pieces[ 0 ], $keySet[ 0 ] ) < 3 ||
            levenshtein( $pieces[ 1 ], $keySet[ 0 ] ) < 3 )
        {
            $add = false;
            break;
        }
    }

    if( $add ) $origs[] = $pieces;
} 

您将得到一个与输出类似的列表。如果您希望列表中包含最短的单词,则需要进行一些修改,但您明白了。

You'd decide a minimum threshold and then use the levenshtein function to determine how close words would have to be.

It looks like you'd more or less be doing this:

$origs = array();
// assuming your example is an array already.
foreach( $setList as $set )
{
    $pieces = explode( ' ', $set );
    $add = true;
    foreach( $origs as $keySet )
    {
        if( levenshtein( $pieces[ 0 ], $keySet[ 0 ] ) < 3 ||
            levenshtein( $pieces[ 1 ], $keySet[ 0 ] ) < 3 )
        {
            $add = false;
            break;
        }
    }

    if( $add ) $origs[] = $pieces;
} 

You'll be left with a list similar to your output. Some modifications will need to be made if you have a preference that the shortest words be in the list, but you get the idea.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文