自定义细分和覆盖分段规则

发布于 2025-01-20 21:47:59 字数 920 浏览 0 评论 0原文

我想使用 Spacy 3.1 使用自定义规则（即 {SENT}）将大型语料库 (.txt) 拆分为句子。

我的主要问题是，我想使用 spacy 即 en_core_web_lg “禁用”预训练的 spacy 模型的分段，但保留所有其他组件（标记化、语法解析器、ner 等）。我总是使用大型模型（我读到，根据所使用的模型，分段的行为可能会有所不同）。

有没有办法覆盖现有规则并仅使用 {SENT} 作为分隔符，同时保留管道的其余部分？如果我将自定义分段添加到解析器之前的管道：nlp.add_pipe(set_custom_segmentation, before='parser')，解析器是否会根据模型提供的分隔符重新分割句子？

我已经尝试了以下方法，但没有成功：

@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == '{SENT}':
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe('segm', before='parser')

到目前为止我已经尝试过但没有成功的解决方案：

向 spacy 提供带有 split("{SENT}") 和 的“句子”列表>re.split("{SENT}")
此处提出的答案

原文

I want to split into sentences a large corpus (.txt) with a custom rule i.e. {SENT} using Spacy 3.1.

My main issue is that I want to "disable" the segmentation from the pretrained spacy models with spacy i.e. en_core_web_lg but keep all the other components (tokenisation, syntactic parser, ner etc.). I am always using the large models (I read that the segmentation may behave differently based on the model used).

Is there a way to override the existing rules and only using {SENT} as a delimiter while maintaining the rest of the pipe ?
If I add the custom segmentation to the pipe before parser : nlp.add_pipe(set_custom_segmentation, before='parser'), will the parser resplit the sentences based on the delimiters provided from the models ?

I already tried the following with no luck :

@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == '{SENT}':
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe('segm', before='parser')

Solutions I've tried until now but didn't work :