自动从字符串生成摘要
给定一个字符串输入,我们需要通过将字符串的末尾修剪为给定的长度来生成一种非常简单的摘要形式。
这是第一个版本的函数:
// Take an array of strings and generate a summary within a given length
function stringSummaryFromMetadata($inArray,$len=80,$sep='§'){
// Filter out 'false' values
$inputs=array_filter($inArray);
// First try just imploding array
$res=implode($sep,$inputs);
// Check for length
if(mb_strlen($res, 'utf8')>$len){
// Calculate 'z' the fixed width constant
$x=count($inputs);
$z=round(($len-$x)/$x);
// Snip all strings to 'z'
$t1=array();
foreach($inputs as $i) $t1[]=mb_substr($i,0,$z);
// Final answer
$res=implode($sep,$t1);
}
return $res;
}
测试:
$test=array(
'Ligula diam risus tempus lorem sit',
'Cursus metus commodo enim odio orci',
'Metus sapien porta sapien fusce sodales',
'king queen'
);
$out=stringSummaryFromMetadata($test);
print $out;
给出:
Ligula diam risus t§Cursus metus 普通§Metus sapien porta §king queen
这已经足够好了,但我确信它可以更加优化。例如,测试输出少于 80 个字母、修剪后字符串末尾有空格、单词被截断等等。
在我开始切线并推出自己的内容之前,我想询问社区是否已解决此问题之前询问过和/或是否已经存在这方面的算法。
Given an input of strings we need to generate a very simple form of summary by trimming off the end of the strings into a given length.
Here is a first version function:
// Take an array of strings and generate a summary within a given length
function stringSummaryFromMetadata($inArray,$len=80,$sep='§'){
// Filter out 'false' values
$inputs=array_filter($inArray);
// First try just imploding array
$res=implode($sep,$inputs);
// Check for length
if(mb_strlen($res, 'utf8')>$len){
// Calculate 'z' the fixed width constant
$x=count($inputs);
$z=round(($len-$x)/$x);
// Snip all strings to 'z'
$t1=array();
foreach($inputs as $i) $t1[]=mb_substr($i,0,$z);
// Final answer
$res=implode($sep,$t1);
}
return $res;
}
A test:
$test=array(
'Ligula diam risus tempus lorem sit',
'Cursus metus commodo enim odio orci',
'Metus sapien porta sapien fusce sodales',
'king queen'
);
$out=stringSummaryFromMetadata($test);
print $out;
Which gives:
Ligula diam risus t§Cursus metus
commod§Metus sapien porta §king queen
Thats good enough but it can be much more optimal I'm sure of that. For example, the test output is less than 80 letters, whitespace at the end of the string after trimming, words are chopped, etc.
Before I go off on a tangent and roll my own I would like to ask the community if this has been asked before and/or if an algorithm already exists for this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 wordwrap 然后计算结果字符串中有多少行。如果超过一个,您的文本就比需要的长,因此您将分隔符附加到第一行的末尾,并丢弃其他行。如果只有一行,则文本较短,因此无需进行修剪。
似乎
wordwrap
不支持 utf8,但是有一个 注释显示了utf8_wordwrap
工作函数。You may use wordwrap and then count how many lines are in the resulting string. If more than one, your text was longer than needed, so you append your separator to the end of the first line, and discard the other lines. If there's only one line, your text was shorter, so no trimming was done.
It seems that
wordwrap
is no utf8 aware, but there's a comment that shows autf8_wordwrap
working function.您还可以构建自动文本摘要算法,如论文使用最短路径算法的基于提取的摘要。这种方法实施起来并不难。
祝你好运!
You can also construct automatic text summarization algorithm as written in paper Extraction based summarization using a shortest path algorithm. This approach is not very hard to implement.
good luck!