自定义细分和覆盖分段规则
我想使用 Spacy 3.1 使用自定义规则(即 {SENT})将大型语料库 (.txt) 拆分为句子。
我的主要问题是,我想使用 spacy 即 en_core_web_lg
“禁用”预训练的 spacy 模型的分段,但保留所有其他组件(标记化、语法解析器、ner 等)。我总是使用大型模型(我读到,根据所使用的模型,分段的行为可能会有所不同)。
有没有办法覆盖现有规则并仅使用 {SENT} 作为分隔符,同时保留管道的其余部分? 如果我将自定义分段添加到解析器之前的管道:nlp.add_pipe(set_custom_segmentation, before='parser')
,解析器是否会根据模型提供的分隔符重新分割句子?
我已经尝试了以下方法,但没有成功:
@Language.component("segm")
def set_custom_segmentation(doc):
for token in doc[:-1]:
if token.text == '{SENT}':
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe('segm', before='parser')
到目前为止我已经尝试过但没有成功的解决方案:
- 向 spacy 提供带有
split("{SENT}")
和的“句子”列表>re.split("{SENT}")
- 此处提出的答案
I want to split into sentences a large corpus (.txt) with a custom rule i.e. {SENT} using Spacy 3.1.
My main issue is that I want to "disable" the segmentation from the pretrained spacy models with spacy i.e. en_core_web_lg
but keep all the other components (tokenisation, syntactic parser, ner etc.). I am always using the large models (I read that the segmentation may behave differently based on the model used).
Is there a way to override the existing rules and only using {SENT} as a delimiter while maintaining the rest of the pipe ?
If I add the custom segmentation to the pipe before parser : nlp.add_pipe(set_custom_segmentation, before='parser')
, will the parser resplit the sentences based on the delimiters provided from the models ?
I already tried the following with no luck :
@Language.component("segm")
def set_custom_segmentation(doc):
for token in doc[:-1]:
if token.text == '{SENT}':
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe('segm', before='parser')
Solutions I've tried until now but didn't work :
- provide to spacy a list of "sentences" with
split("{SENT}")
andre.split("{SENT}")
- The answer proposed here
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您不希望解析器可能添加额外的句子边界,则还必须为剩余标记设置
token.is_sent_start = False
。You have to also set
token.is_sent_start = False
for the remaining tokens if you don't want the parser to potentially add additional sentence boundaries.