Spacy V3自定义句子细分
我想使用自定义定界符IE {s}
将大型语料库(.txt)分为句子。我正在与Spacy 3.1合作。
以以下句子为例,应将其视为一个:
{S} — Quel âge as -tu? demanda Angel. — Je ne sais pas, — Sais -tu faire la soupe ?{S}
spacy返回:
{S}
—
Quel âge as
-tu?
demanda Angel.
— Je ne sais pas, —
Sais
-tu faire la soupe ?
我已经在没有运气的情况下尝试了以下句子:
@Language.component("segm")
def set_custom_segmentation(doc):
for token in doc[:-1]:
if token.text == '{S}':
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe('segm', first=True)
以及将{s}视为单个令牌的规则:
special_case = [{ORTH: "{S}"}]
nlp.tokenizer.add_special_case("{S}", special_case)
I want to split into sentences a large corpus (.txt) only using a custom delimiter i.e. {S}
. I am working with Spacy 3.1.
Taking as an example the following sentence, which should be considered as one :
{S} — Quel âge as -tu? demanda Angel. — Je ne sais pas, — Sais -tu faire la soupe ?{S}
Spacy returns :
{S}
—
Quel âge as
-tu?
demanda Angel.
— Je ne sais pas, —
Sais
-tu faire la soupe ?
I have already tried the following with no luck :
@Language.component("segm")
def set_custom_segmentation(doc):
for token in doc[:-1]:
if token.text == '{S}':
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe('segm', first=True)
as well as a rule to consider {S} as a single token :
special_case = [{ORTH: "{S}"}]
nlp.tokenizer.add_special_case("{S}", special_case)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您要使用
token.is_sent_start = true
添加句子边界,因此更像:You want to use
token.is_sent_start = True
to add sentence boundaries, so something more like: