如何将段落拆分成句子
我一直在尝试使用:
$string="The Dr. is here!!! I am glad I'm in the U.S.A. for the Dr. quality is great!!!!!!";
preg_match_all('~.*?[?.!]~s',$string,$sentences);
print_r($sentences);
但它对美国博士等不起作用。
有人有更好的建议吗?
I've been trying to use:
$string="The Dr. is here!!! I am glad I'm in the U.S.A. for the Dr. quality is great!!!!!!";
preg_match_all('~.*?[?.!]~s',$string,$sentences);
print_r($sentences);
But it doesn't work on Dr., U.S.A., etc.
Does anyone have any better suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
对此没有任何简单的解决方案。您需要在应用程序中进行一些自然语言处理(NLP)并识别每个句子。有一个叫做OpenNLP的东西,它是一个基于JAVA的NLP解析器工具。或者 Ruby 中的 Stanford NLP 解析器。你可以在 php.ini 中找到类似的东西。
这里我找到了一组用于 PHP 中自然语言处理的类。
there is not any simple solution for that. you need do some natural language processing(NLP) in your application and recognize each sentence. there is something call OpenNLP, it's a JAVA-based NLP parser tool. Or Stanford NLP parser in Ruby. you can find something like that for php.
here I found a set of classes for natural language processing in PHP.
嗯,也许可以尝试类似
$sentences = preg_split('/.*?[?.!]+\s+/', $string);
hmmm maybe try something like
$sentences = preg_split('/.*?[?.!]+\s+/', $string);
这几乎是不可能的,因为您的示例清楚地表明可以在例如 Dr.、USA 等中使用的标点符号,使得不可能知道句子在哪里开始/结束。
您必须搜索以下字符来决定是否有一个新句子跟在您提到的标点符号后面(在之后开始)。
This is almost impossible since your example clearly indicates that punctuation characters that can be used in e.g. Dr., U.S.A etc, make it impossible to know where a sentence starts/ends.
You have to search the following characters to decide if a new sentence follows (starts after) the punctuation chars you are mentioning.