php句子边界检测
我想用 PHP 将文本分成句子。我目前正在使用正则表达式,它的准确率约为 95%,并且希望通过使用更好的方法来改进。我见过用 Perl、Java 和 C 实现此目的的 NLP 工具,但没有看到任何适合 PHP 的工具。你知道这样的工具吗?
I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
增强的正则表达式解决方案
假设您确实关心处理:
Mr.
和Mrs.
等缩写,那么以下单个正则表达式解决方案效果很好:请注意,您可以轻松添加或者从表达式中删除缩写。给出以下测试段落:
以下是脚本的输出:
Sentence[1] = [这是第一句话。]
句子[2] = [句子二!]
句子[3] = [句子三?]
句子[4] = [句子“四”。]
Sentence[5] = [句子“五”!]
句子[6] = [句子“六”?]
句子[7] = [句子“七。”]
句子[8] = [句子“八!”]
句子[9] = [博士。琼斯说:“史密斯夫人,您有一个可爱的女儿!”]
Sentence[10] = [The TVA is a big project!]
基本的正则表达式解决方案
问题的作者评论说,上述解决方案“忽略了许多选项”,并且不是足够通用。我不确定这意味着什么,但上述表达式的本质是尽可能干净和简单的。如下:
请注意,两种解决方案都能正确识别结尾标点符号后以引号结尾的句子。如果您不关心匹配以引号结尾的句子,则正则表达式可以简化为:
/(?<=[.!?])\s+(?=\S)/
。编辑:20130820_1000向正则表达式和测试字符串添加了
TVA
(另一个要忽略的标点词)。 (回答PapyRef的评论问题)编辑:20130820_1800整理并重命名正则表达式并添加shebang。还修复了正则表达式,以防止在尾随空格上分割文本。
An enhanced regex solution
Assuming you do care about handling:
Mr.
andMrs.
etc. abbreviations, then the following single regex solution works pretty well:Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:
Here is the output from the script:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
The essential regex solution
The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:
Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just:
/(?<=[.!?])\s+(?=\S)/
.Edit: 20130820_1000 Added
T.V.A.
(another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.
稍微改进别人的工作:
Slight improvement on someone else's work:
作为一种低技术含量的方法,您可能需要考虑在循环中使用一系列
explode
调用,使用 .、! 和 ?作为你的针。这将占用大量内存和处理器(就像大多数文本处理一样)。您将拥有一堆临时数组和一个主数组,其中所有找到的句子都按正确的顺序进行数字索引。此外,您还必须检查常见异常(例如 Mr. 和 Dr. 等标题中的 .),但由于所有内容都在数组中,因此这些类型支票应该不会那么糟糕。
我不确定这在速度和扩展方面是否比正则表达式更好,但值得一试。您想要分解成句子的这些文本块有多大?
As a low-tech approach, you might want to consider using a series of
explode
calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.
I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?
我正在使用这个正则表达式:
不适用于以数字开头的句子,但误报也很少。当然,你所做的事情也很重要。我的程序现在使用它
是因为我认为速度比准确性更重要。
I was using this regex:
Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses
because I decided speed was more important than accuracy.
像这样构建一个缩写列表
将它们编译成一个表达式
最后运行这个 preg_split 来分解成句子。
如果您正在处理 HTML,请注意标记被删除,这些标记消除了句子之间的空格。
如果您有
situations.Like
where. They
粘在一起就变得非常难以解析。Build a list of abbreviations like this
Compile them into a an expression
Last run this preg_split to break into sentences.
And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.
<p></p>
If you havesituations.Like
thiswhere.They
stick together it becomes immensely more difficult to parse.@ridgerunner 我用 C# 编写了你的 PHP 代码,
结果是 2 句话:
正确的结果应该是这样的句子:Mr. J. Dujardin régle sa TVA en esp.独特性
以及我们的测试段落
结果是
C# 代码:
@ridgerunner I wrote your PHP code in C #
I get like 2 sentences as result :
The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement
and with our test paragraph
The result is
C# code :