使用包含缩写的正则表达式在 Python 中拆分段落
尝试在由 3 个字符串和缩写组成的段落上使用此函数。
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
下一个开始句子的第一个字符被消除,
O/p Recieved: While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô is the only mango tree ommonly cultivated in many tropical and subtropical regions, and its fruit is di stributed essentially worldwide.In several cultures, its fruit and leaves are ri tually used as floral decorations at weddings, public celebrations and religious.
因此字符串被分割成仅2个字符串,并且下一个句子的第一个字符被消除。还可以看到一些奇怪的字符,我猜python无法转换连字符。
如果我将正则表达式更改为 [.!?][\s]{1,2}
While other species (e.g horse mango, M foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ Çô is the only mango tree Commonly cultivated in many tropical and subtropical regions, and its fruit is d istributed essentially worldwide.In several cultures, its fruit and leaves are r itually used as floral decorations at weddings, public celebrations and religiou s
因此,即使是缩写也会被分割。
Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved: While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô is the only mango tree ommonly cultivated in many tropical and subtropical regions, and its fruit is di stributed essentially worldwide.In several cultures, its fruit and leaves are ri tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g horse mango, M foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ Çô is the only mango tree Commonly cultivated in many tropical and subtropical regions, and its fruit is d istributed essentially worldwide.In several cultures, its fruit and leaves are r itually used as floral decorations at weddings, public celebrations and religiou s
Thus even the abbreviations get splitted.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您想要的正则表达式是:
您想要一个正向先行断言,这意味着您希望匹配后面跟着大写字母的模式,但不匹配大写字母。
只有第一个得到匹配的原因是第二个句点之后没有空格。
The regex you want is:
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.