自动从字符串生成摘要

发布于 2024-10-20 04:42:50 字数 1248 浏览 6 评论 0原文

给定一个字符串输入,我们需要通过将字符串的末尾修剪为给定的长度来生成一种非常简单的摘要形式。

这是第一个版本的函数:

// Take an array of strings and generate a summary within a given length
function stringSummaryFromMetadata($inArray,$len=80,$sep='§'){

    // Filter out 'false' values
    $inputs=array_filter($inArray);

    // First try just imploding array
    $res=implode($sep,$inputs);

    // Check for length
    if(mb_strlen($res, 'utf8')>$len){

        // Calculate 'z' the fixed width constant
        $x=count($inputs);
        $z=round(($len-$x)/$x);

        // Snip all strings to 'z'
        $t1=array();
        foreach($inputs as $i) $t1[]=mb_substr($i,0,$z);

        // Final answer
        $res=implode($sep,$t1);
    }

    return $res;
}

测试:

$test=array(
    'Ligula diam risus tempus lorem sit',
    'Cursus metus commodo enim odio orci',
    'Metus sapien porta sapien fusce sodales',
    'king queen'
);
$out=stringSummaryFromMetadata($test);
print $out;

给出:

Ligula diam risus t§Cursus metus 普通§Metus sapien porta §king queen

这已经足够好了,但我确信它可以更加优化。例如,测试输出少于 80 个字母、修剪后字符串末尾有空格、单词被截断等等。

在我开始切线并推出自己的内容之前,我想询问社区是否已解决此问题之前询问过和/或是否已经存在这方面的算法。

Given an input of strings we need to generate a very simple form of summary by trimming off the end of the strings into a given length.

Here is a first version function:

// Take an array of strings and generate a summary within a given length
function stringSummaryFromMetadata($inArray,$len=80,$sep='§'){

    // Filter out 'false' values
    $inputs=array_filter($inArray);

    // First try just imploding array
    $res=implode($sep,$inputs);

    // Check for length
    if(mb_strlen($res, 'utf8')>$len){

        // Calculate 'z' the fixed width constant
        $x=count($inputs);
        $z=round(($len-$x)/$x);

        // Snip all strings to 'z'
        $t1=array();
        foreach($inputs as $i) $t1[]=mb_substr($i,0,$z);

        // Final answer
        $res=implode($sep,$t1);
    }

    return $res;
}

A test:

$test=array(
    'Ligula diam risus tempus lorem sit',
    'Cursus metus commodo enim odio orci',
    'Metus sapien porta sapien fusce sodales',
    'king queen'
);
$out=stringSummaryFromMetadata($test);
print $out;

Which gives:

Ligula diam risus t§Cursus metus
commod§Metus sapien porta §king queen

Thats good enough but it can be much more optimal I'm sure of that. For example, the test output is less than 80 letters, whitespace at the end of the string after trimming, words are chopped, etc.

Before I go off on a tangent and roll my own I would like to ask the community if this has been asked before and/or if an algorithm already exists for this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

天涯沦落人 2024-10-27 04:42:50

您可以使用 wordwrap 然后计算结果字符串中有多少行。如果超过一个,您的文本就比需要的长,因此您将分隔符附加到第一行的末尾,并丢弃其他行。如果只有一行,则文本较短,因此无需进行修剪。

似乎 wordwrap 不支持 utf8,但是有一个 注释显示了utf8_wordwrap工作函数。

You may use wordwrap and then count how many lines are in the resulting string. If more than one, your text was longer than needed, so you append your separator to the end of the first line, and discard the other lines. If there's only one line, your text was shorter, so no trimming was done.

It seems that wordwrap is no utf8 aware, but there's a comment that shows a utf8_wordwrap working function.

川水往事 2024-10-27 04:42:50

您还可以构建自动文本摘要算法,如论文使用最短路径算法的基于提取的摘要。这种方法实施起来并不难。

祝你好运!

You can also construct automatic text summarization algorithm as written in paper Extraction based summarization using a shortest path algorithm. This approach is not very hard to implement.

good luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文