从一系列句子的单个句子diskenize

发布于 2025-01-23 08:22:56 字数 953 浏览 3 评论 0原文

我有一个任务问题，我必须在其中找到一种方法来从一串句子中对单个句子进行描述。句子是由完整停止终止的任何单词序列（包括完整的停止本身）。如果无法分割句子，则它将返回一个空列表。我还可以保证，文档不会从完整的停止字符开始。这仍然是Python的基本水平。

这是我开始使用拆分函数的代码

def sentence_segmentation(document):
    """splits each word"""
    sentence_new = document.split(".")
    final = list(sentence_new)
    return final

，但是我不确定如何保留定界符，尤其是在句子中有多个时？

我包括测试用例

测试1

document = "sent1. sent2. sent3. sent4. sent5."
sentences = sentence_segmentation(document)
print(sentences)

结果

['send1。'，'send2。'，'send3。'，'send4。'，'send5。']

测试2

document = "sent 1. sent 2... sent 3.... sent 4. sent 5.."
sentences = sentence_segmentation(document)
print(sentences)

结果

['send 1.'，'send 2 .. 。

document = "sent1.sent2.sent3.sent4.sent5."
sentences = sentence_segmentation(document)
print(sentences)

原文

I have an assignment question where i have to find a way to tokenize individual sentences from a string of sentence. A sentence is any sequence of words that is terminated by a full stop (and including the full stop itself).
If no sentences could be segmented then it returns an empty list. I am also guaranteed that a document will not begin with the full stop character. This is still a basic level of python.

this is the code i have started with using the split function

def sentence_segmentation(document):
    """splits each word"""
    sentence_new = document.split(".")
    final = list(sentence_new)
    return final

however im unsure on how to keep the delimeter and especially when there is more than one in the sentence?

i have included the test cases

Test 1

document = "sent1. sent2. sent3. sent4. sent5."
sentences = sentence_segmentation(document)
print(sentences)

result

['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']

test 2

document = "sent 1. sent 2... sent 3.... sent 4. sent 5.."
sentences = sentence_segmentation(document)
print(sentences)

result

['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']

test 3

document = "sent1.sent2.sent3.sent4.sent5."
sentences = sentence_segmentation(document)
print(sentences)

result

['sent1.sent2.sent3.sent4.sent5.']

Thanks !

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薯片软お妹 2025-01-30 08:22:56

看来您想用一个或多个的白色空间拆分，该空格先于。 char。

然后，您可以使用

import re
documents = ["sent1. sent2. sent3. sent4. sent5.", "sent 1. sent 2... sent 3.... sent 4. sent 5..", "sent1.sent2.sent3.sent4.sent5."]
dot_space_regex = re.compile(r'(?<=\.)\s+')
for doc in documents:
    print(dot_space_regex.split(doc))

在线演示。
输出：

['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']
['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']
['sent1.sent2.sent3.sent4.sent5.']

请参阅 regex demo 。

？
（更多的空间。

It looks like you want to split with one or more whitespaces that are preceded with a . char.

Then you can use

import re
documents = ["sent1. sent2. sent3. sent4. sent5.", "sent 1. sent 2... sent 3.... sent 4. sent 5..", "sent1.sent2.sent3.sent4.sent5."]
dot_space_regex = re.compile(r'(?<=\.)\s+')
for doc in documents:
    print(dot_space_regex.split(doc))

See the online demo.
Output:

['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']
['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']
['sent1.sent2.sent3.sent4.sent5.']

See the regex demo.

(?<=\.) - a positive lookbehind that matches a location that is immediately preceded with a . char
\s+ - one or more whitespaces.

回复收藏 0 原文

~没有更多了~

关于作者

甜是你

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

从一系列句子的单个句子diskenize

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

从一系列句子的单个句子diskenize

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。