php句子边界检测

发布于 2024-10-17 21:53:11 字数 126 浏览 13 评论 0原文

我想用 PHP 将文本分成句子。我目前正在使用正则表达式,它的准确率约为 95%,并且希望通过使用更好的方法来改进。我见过用 Perl、Java 和 C 实现此目的的 NLP 工具,但没有看到任何适合 PHP 的工具。你知道这样的工具吗?

I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

你げ笑在眉眼 2024-10-24 21:53:11

增强的正则表达式解决方案

假设您确实关心处理:Mr.Mrs. 等缩写,那么以下单个正则表达式解决方案效果很好:

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://stackoverflow.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

请注意,您可以轻松添加或者从表达式中删除缩写。给出以下测试段落:

这是第一句话。第二句话!第三句?句“四”。句“五”!句子“六”?句子“七”。句子“八!”琼斯医生说:“史密斯夫人,您有一个可爱的女儿!” TVA 是一个大项目!

以下是脚本的输出:

Sentence[1] = [这是第一句话。]
句子[2] = [句子二!]
句子[3] = [句子三?]
句子[4] = [句子“四”。]
Sentence[5] = [句子“五”!]
句子[6] = [句子“六”?]
句子[7] = [句子“七。”]
句子[8] = [句子“八!”]
句子[9] = [博士。琼斯说:“史密斯夫人,您有一个可爱的女儿!”]
Sentence[10] = [The TVA is a big project!]

基本的正则表达式解决方案

问题的作者评论说,上述解决方案“忽略了许多选项”,并且不是足够通用。我不确定这意味着什么,但上述表达式的本质是尽可能干净和简单的。如下:

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

请注意,两种解决方案都能正确识别结尾标点符号后以引号结尾的句子。如果您不关心匹配以引号结尾的句子,则正则表达式可以简化为: /(?<=[.!?])\s+(?=\S)/

编辑:20130820_1000向正则表达式和测试字符串添加了TVA(另一个要忽略的标点词)。 (回答PapyRef的评论问题)

编辑:20130820_1800整理并重命名正则表达式并添加shebang。还修复了正则表达式,以防止在尾随空格上分割文本。

An enhanced regex solution

Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://stackoverflow.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:

This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!

Here is the output from the script:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]

The essential regex solution

The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.

Edit: 20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)

Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.

水溶 2024-10-24 21:53:11

稍微改进别人的工作:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);

Slight improvement on someone else's work:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);
日记撕了你也走了 2024-10-24 21:53:11

作为一种低技术含量的方法,您可能需要考虑在循环中使用一系列 explode 调用,使用 .、! 和 ?作为你的针。这将占用大量内存和处理器(就像大多数文本处理一样)。您将拥有一堆临时数组和一个主数组,其中所有找到的句子都按正确的顺序进行数字索引。

此外,您还必须检查常见异常(例如 Mr.Dr. 等标题中的 .),但由于所有内容都在数组中,因此这些类型支票应该不会那么糟糕。

我不确定这在速度和扩展方面是否比正则表达式更好,但值得一试。您想要分解成句子的这些文本块有多大?

As a low-tech approach, you might want to consider using a series of explode calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.

Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.

I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?

乖乖公主 2024-10-24 21:53:11

我正在使用这个正则表达式:

preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);

不适用于以数字开头的句子,但误报也很少。当然,你所做的事情也很重要。我的程序现在使用它

explode('.',$text);

是因为我认为速度比准确性更重要。

I was using this regex:

preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);

Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses

explode('.',$text);

because I decided speed was more important than accuracy.

反目相谮 2024-10-24 21:53:11

像这样构建一个缩写列表

$skip_array = array ( 

'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.

将它们编译成一个表达式

$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}

最后运行这个 preg_split 来分解成句子。

$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
                     $txt, -1, PREG_SPLIT_NO_EMPTY);

如果您正在处理 HTML,请注意标记被删除,这些标记消除了句子之间的空格。

如果您有situations.Like where. They粘在一起就变得非常难以解析。

Build a list of abbreviations like this

$skip_array = array ( 

'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.

Compile them into a an expression

$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}

Last run this preg_split to break into sentences.

$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
                     $txt, -1, PREG_SPLIT_NO_EMPTY);

And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.<p></p> If you have situations.Like this where.They stick together it becomes immensely more difficult to parse.

终遇你 2024-10-24 21:53:11

@ridgerunner 我用 C# 编写了你的​​ PHP 代码,

结果是 2 句话:

  • 先生。 J. Dujardin régle sa TV
  • A.特别是uniquement

正确的结果应该是这样的句子:Mr. J. Dujardin régle sa TVA en esp.独特性

以及我们的测试段落

string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

结果是

index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

C# 代码:

                string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
                Regex rx = new Regex(@"(\S.+?
                                       [.!?]               # Either an end of sentence punct,
                                       | [.!?]['""]         # or end of sentence punct and quote.
                                       )
                                       (?<!                 # Begin negative lookbehind.
                                          Mr.                   # Skip either Mr.
                                        | Mrs.                  # or Mrs.,
                                        | Ms.                   # or Ms.,
                                        | Jr.                   # or Jr.,
                                        | Dr.                   # or Dr.,
                                        | Prof.                 # or Prof.,
                                        | Sr.                   # or Sr.,
                                        | \s[A-Z].              # or initials ex: George W. Bush,
                                        | T\.V\.A\.             # or "T.V.A."
                                       )                    # End negative lookbehind.
                                       (?=|\s+|$)", 
                                       RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
                foreach (Match match in rx.Matches(sText))
                {
                    Console.WriteLine("index: {0}  sentence: {1}", match.Index, match.Value);
                }

@ridgerunner I wrote your PHP code in C #

I get like 2 sentences as result :

  • Mr. J. Dujardin régle sa T.V.
  • A. en esp. uniquement

The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement

and with our test paragraph

string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

The result is

index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

C# code :

                string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
                Regex rx = new Regex(@"(\S.+?
                                       [.!?]               # Either an end of sentence punct,
                                       | [.!?]['""]         # or end of sentence punct and quote.
                                       )
                                       (?<!                 # Begin negative lookbehind.
                                          Mr.                   # Skip either Mr.
                                        | Mrs.                  # or Mrs.,
                                        | Ms.                   # or Ms.,
                                        | Jr.                   # or Jr.,
                                        | Dr.                   # or Dr.,
                                        | Prof.                 # or Prof.,
                                        | Sr.                   # or Sr.,
                                        | \s[A-Z].              # or initials ex: George W. Bush,
                                        | T\.V\.A\.             # or "T.V.A."
                                       )                    # End negative lookbehind.
                                       (?=|\s+|$)", 
                                       RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
                foreach (Match match in rx.Matches(sText))
                {
                    Console.WriteLine("index: {0}  sentence: {1}", match.Index, match.Value);
                }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文