从一系列句子的单个句子diskenize
我有一个任务问题,我必须在其中找到一种方法来从一串句子中对单个句子进行描述。句子是由完整停止终止的任何单词序列(包括完整的停止本身)。 如果无法分割句子,则它将返回一个空列表。我还可以保证,文档不会从完整的停止字符开始。这仍然是Python的基本水平。
这是我开始使用拆分函数的代码
def sentence_segmentation(document):
"""splits each word"""
sentence_new = document.split(".")
final = list(sentence_new)
return final
,但是我不确定如何保留定界符,尤其是在句子中有多个时?
我包括测试用例
测试1
document = "sent1. sent2. sent3. sent4. sent5."
sentences = sentence_segmentation(document)
print(sentences)
结果
['send1。','send2。','send3。','send4。','send5。']
测试2
document = "sent 1. sent 2... sent 3.... sent 4. sent 5.."
sentences = sentence_segmentation(document)
print(sentences)
结果
['send 1.','send 2 .. 。
document = "sent1.sent2.sent3.sent4.sent5."
sentences = sentence_segmentation(document)
print(sentences)
I have an assignment question where i have to find a way to tokenize individual sentences from a string of sentence. A sentence is any sequence of words that is terminated by a full stop (and including the full stop itself).
If no sentences could be segmented then it returns an empty list. I am also guaranteed that a document will not begin with the full stop character. This is still a basic level of python.
this is the code i have started with using the split function
def sentence_segmentation(document):
"""splits each word"""
sentence_new = document.split(".")
final = list(sentence_new)
return final
however im unsure on how to keep the delimeter and especially when there is more than one in the sentence?
i have included the test cases
Test 1
document = "sent1. sent2. sent3. sent4. sent5."
sentences = sentence_segmentation(document)
print(sentences)
result
['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']
test 2
document = "sent 1. sent 2... sent 3.... sent 4. sent 5.."
sentences = sentence_segmentation(document)
print(sentences)
result
['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']
test 3
document = "sent1.sent2.sent3.sent4.sent5."
sentences = sentence_segmentation(document)
print(sentences)
result
['sent1.sent2.sent3.sent4.sent5.']
Thanks !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看来您想用一个或多个的白色空间拆分,该空格先于
。
char。然后,您可以使用
在线演示。
输出:
请参阅 regex demo 。
( 更多的空间。
It looks like you want to split with one or more whitespaces that are preceded with a
.
char.Then you can use
See the online demo.
Output:
See the regex demo.
(?<=\.)
- a positive lookbehind that matches a location that is immediately preceded with a.
char\s+
- one or more whitespaces.