从一系列句子的单个句子diskenize

发布于 2025-01-23 08:22:56 字数 953 浏览 0 评论 0原文

我有一个任务问题,我必须在其中找到一种方法来从一串句子中对单个句子进行描述。句子是由完整停止终止的任何单词序列(包括完整的停止本身)。 如果无法分割句子,则它将返回一个空列表。我还可以保证,文档不会从完整的停止字符开始。这仍然是Python的基本水平。

这是我开始使用拆分函数的代码

def sentence_segmentation(document):
    """splits each word"""
    sentence_new = document.split(".")
    final = list(sentence_new)
    return final

,但是我不确定如何保留定界符,尤其是在句子中有多个时?

我包括测试用例

测试1

document = "sent1. sent2. sent3. sent4. sent5."
sentences = sentence_segmentation(document)
print(sentences)

结果

['send1。','send2。','send3。','send4。','send5。']

测试2

document = "sent 1. sent 2... sent 3.... sent 4. sent 5.."
sentences = sentence_segmentation(document)
print(sentences)

结果

['send 1.','send 2 .. 。

document = "sent1.sent2.sent3.sent4.sent5."
sentences = sentence_segmentation(document)
print(sentences)

I have an assignment question where i have to find a way to tokenize individual sentences from a string of sentence. A sentence is any sequence of words that is terminated by a full stop (and including the full stop itself).
If no sentences could be segmented then it returns an empty list. I am also guaranteed that a document will not begin with the full stop character. This is still a basic level of python.

this is the code i have started with using the split function

def sentence_segmentation(document):
    """splits each word"""
    sentence_new = document.split(".")
    final = list(sentence_new)
    return final

however im unsure on how to keep the delimeter and especially when there is more than one in the sentence?

i have included the test cases

Test 1

document = "sent1. sent2. sent3. sent4. sent5."
sentences = sentence_segmentation(document)
print(sentences)

result

['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']

test 2

document = "sent 1. sent 2... sent 3.... sent 4. sent 5.."
sentences = sentence_segmentation(document)
print(sentences)

result

['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']

test 3

document = "sent1.sent2.sent3.sent4.sent5."
sentences = sentence_segmentation(document)
print(sentences)

result

['sent1.sent2.sent3.sent4.sent5.']

Thanks !

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

薯片软お妹 2025-01-30 08:22:56

看来您想用一个或多个的白色空间拆分,该空格先于 char。

然后,您可以使用

import re
documents = ["sent1. sent2. sent3. sent4. sent5.", "sent 1. sent 2... sent 3.... sent 4. sent 5..", "sent1.sent2.sent3.sent4.sent5."]
dot_space_regex = re.compile(r'(?<=\.)\s+')
for doc in documents:
    print(dot_space_regex.split(doc))

在线演示
输出:

['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']
['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']
['sent1.sent2.sent3.sent4.sent5.']

请参阅 regex demo

  • ( 更多的空间。

It looks like you want to split with one or more whitespaces that are preceded with a . char.

Then you can use

import re
documents = ["sent1. sent2. sent3. sent4. sent5.", "sent 1. sent 2... sent 3.... sent 4. sent 5..", "sent1.sent2.sent3.sent4.sent5."]
dot_space_regex = re.compile(r'(?<=\.)\s+')
for doc in documents:
    print(dot_space_regex.split(doc))

See the online demo.
Output:

['sent1.', 'sent2.', 'sent3.', 'sent4.', 'sent5.']
['sent 1.', 'sent 2...', 'sent 3....', 'sent 4.', 'sent 5..']
['sent1.sent2.sent3.sent4.sent5.']

See the regex demo.

  • (?<=\.) - a positive lookbehind that matches a location that is immediately preceded with a . char
  • \s+ - one or more whitespaces.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文