使用包含缩写的正则表达式在 Python 中拆分段落

发布于 2024-11-29 02:34:01 字数 2011 浏览 0 评论 0原文

尝试在由 3 个字符串和缩写组成的段落上使用此函数。

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

下一个开始句子的第一个字符被消除,

O/p Recieved:
 While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
 more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
 is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.

因此字符串被分割成仅2个字符串,并且下一个句子的第一个字符被消除。还可以看到一些奇怪的字符,我猜python无法转换连字符。

如果我将正则表达式更改为 [.!?][\s]{1,2}

While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s

因此,即使是缩写也会被分割。

Tried using this function on a paragraph consisting of 3 strings and abbreviations.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

The first character of the next beggining sentence is eliminated,

O/p Recieved:
 While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
 more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
 is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.

Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.

Incase I alter the regex to [.!?][\s]{1,2}

While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s

Thus even the abbreviations get splitted.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凡尘雨 2024-12-06 02:34:01

您想要的正则表达式是:

[.!?][\s]{1,2}(?=[A-Z])

您想要一个正向先行断言,这意味着您希望匹配后面跟着大写字母的模式,但不匹配大写字母

只有第一个得到匹配的原因是第二个句点之后没有空格。

The regex you want is:

[.!?][\s]{1,2}(?=[A-Z])

You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.

The reason only the first one got matched is you don't have a space after the 2nd period.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文