当我们使用自定义分词器时,为什么 spacy morphologizer 不起作用?
我不明白为什么当我这样做时,
import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")
class MyTokenizer:
def __init__(self, tokenizer):
self.tokenizer = deepcopy(tokenizer)
def __call__(self, text):
return self.tokenizer(text)
nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")
令牌没有分配任何变形
print([tok.morph for tok in doc])
> ['','','','','']
这是预期的行为吗?如果是,为什么? (spacy v3.0.7)
I don't understand why when i'm doing this
import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")
class MyTokenizer:
def __init__(self, tokenizer):
self.tokenizer = deepcopy(tokenizer)
def __call__(self, text):
return self.tokenizer(text)
nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")
Tokens don't have any morph assigned
print([tok.morph for tok in doc])
> ['','','','','']
Is this behavior expected? If yes, why ? (spacy v3.0.7)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
管道期望
nlp.vocab
和nlp.tokenizer.vocab
引用完全相同的Vocab
对象,但运行后情况并非如此深复制
。我承认我并不完全确定为什么你最终会得到空分析而不是更具体的错误,但我认为
MorphAnalysis
对象集中存储在词汇表中vocab.morphology
,最终导致两个词汇之间不同步。The pipeline expects
nlp.vocab
andnlp.tokenizer.vocab
to refer to the exact sameVocab
object, which isn't the case after runningdeepcopy
.I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the
MorphAnalysis
objects, which are stored centrally in the vocab invocab.morphology
, end up out-of-sync between the two vocabs.