Spacy Matcher Is不总是匹配
我无法弄清楚为什么匹配器不起作用。这有效:
test = ["14k"]
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("test", [[{"NORM": "14k"}]])
docs = []
for doc in nlp.pipe(test):
matches = matcher(doc)
print(matches)
但是,如果我在匹配器和文字中将14k更改为14k,则匹配器一无所获。为什么?我只想了解差异以及为什么这不起作用,以及如何将来自己对此进行故障排除。我已经查看了文档:
https://spacy.io/api/api/matcher
and -can and Can and Can and Can and Can'弄清楚我出错了哪里。我将“ Norm”更改为Orth和Text,但仍然没有找到它。谢谢您的任何帮助。
编辑 好的,所以我做到了:
for ent in doc:
print(ent)
对于小写的版本,Spacy将其全部录制为一个Ent,但是当我提高K时,Spacy说了这两种不同。有了这些知识,我做到了,matcher.add(“ test”,[[{“ orth”:“ 14”},{“ orth”:“ k”}]]])
,并且起作用。
我仍然想知道为什么。 Spacy为什么认为14K是一个“单词”,而14k是两个“单词”?
I can't figure out why the matcher isn't working. This works:
test = ["14k"]
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("test", [[{"NORM": "14k"}]])
docs = []
for doc in nlp.pipe(test):
matches = matcher(doc)
print(matches)
but if I change 14k to 14K in both my matcher and text, the matcher finds nothing. Why? I just want to understand the difference and why this doesn't work and how I could go about troubleshooting this myself in the future. I've looked at the docs:
and can't figure out where I'm going wrong. I changed "NORM" to ORTH and TEXT and it still hasn't found it. Thank you for any help.
EDIT
OK, so I did:
for ent in doc:
print(ent)
and for the lowercase version, Spacy was catorgizing it all as one ent, but when I uppercased the K, Spacy says it two different ents. With this knowledge I did, matcher.add("test", [[{"ORTH": "14"}, {"ORTH":"K"}]])
and it worked.
I still want to know why. Why does Spacy think 14k is one "word" but 14K is two "words"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看来您可能会遇到这种序列的令牌化差异的问题。尤其要注意的是,看起来像温度的事物(因此数字 + [fck])可能会得到特殊治疗。这似乎很奇怪,但通常会导致与现有语料库的兼容性更好。
您可以找出为什么使用
tokenizer.explain()
这样的输入将输入归为特定方式:提供输出:
您可以在
tokenizer.tokenizer.explain
docs 。It looks like you may be running into issues with differences in tokenization for this kind of sequence. In particular note that things that look like temperatures (so number + [FCK]) may get special treatment. This may seem odd but it usually results in better compatibility with existing corpora.
You can find out why an input is tokenized a particular way using
tokenizer.explain()
like so:That gives the output:
You can read more about this at the
tokenizer.explain
docs.