Spacy Matcher Is不总是匹配

发布于 2025-01-31 14:08:03 字数 864 浏览 3 评论 0原文

我无法弄清楚为什么匹配器不起作用。这有效：

test = ["14k"]

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

matcher.add("test", [[{"NORM": "14k"}]])

docs = []
for doc in nlp.pipe(test):
    matches = matcher(doc)
    print(matches)

但是，如果我在匹配器和文字中将14k更改为14k，则匹配器一无所获。为什么？我只想了解差异以及为什么这不起作用，以及如何将来自己对此进行故障排除。我已经查看了文档：

https：//spacy.io/api/api/matcher

and -can and Can and Can and Can and Can'弄清楚我出错了哪里。我将“ Norm”更改为Orth和Text，但仍然没有找到它。谢谢您的任何帮助。

编辑好的，所以我做到了：

for ent in doc:
   print(ent)

对于小写的版本，Spacy将其全部录制为一个Ent，但是当我提高K时，Spacy说了这两种不同。有了这些知识，我做到了，matcher.add（“ test”，[[{“ orth”：“ 14”}，{“ orth”：“ k”}]]]），并且起作用。

我仍然想知道为什么。 Spacy为什么认为14K是一个“单词”，而14k是两个“单词”？

原文

I can't figure out why the matcher isn't working. This works:

test = ["14k"]

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

matcher.add("test", [[{"NORM": "14k"}]])

docs = []
for doc in nlp.pipe(test):
    matches = matcher(doc)
    print(matches)

but if I change 14k to 14K in both my matcher and text, the matcher finds nothing. Why? I just want to understand the difference and why this doesn't work and how I could go about troubleshooting this myself in the future. I've looked at the docs:

https://spacy.io/api/matcher

and can't figure out where I'm going wrong. I changed "NORM" to ORTH and TEXT and it still hasn't found it. Thank you for any help.

EDIT
OK, so I did:

for ent in doc:
   print(ent)

and for the lowercase version, Spacy was catorgizing it all as one ent, but when I uppercased the K, Spacy says it two different ents. With this knowledge I did, matcher.add("test", [[{"ORTH": "14"}, {"ORTH":"K"}]]) and it worked.

I still want to know why. Why does Spacy think 14k is one "word" but 14K is two "words"?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我们只是彼此的过ke 2025-02-07 14:08:03

看来您可能会遇到这种序列的令牌化差异的问题。尤其要注意的是，看起来像温度的事物（因此数字 + [fck]）可能会得到特殊治疗。这似乎很奇怪，但通常会导致与现有语料库的兼容性更好。

您可以找出为什么使用tokenizer.explain（）这样的输入将输入归为特定方式：

import spacy

nlp = spacy.blank("en")
print(nlp.tokenizer.explain("14K"))
print("...")
print(nlp.tokenizer.explain("14k"))

提供输出：

[('TOKEN', '14'), ('SUFFIX', 'K')]
...
[('TOKEN', '14k')]

您可以在 tokenizer.tokenizer.explain docs 。

It looks like you may be running into issues with differences in tokenization for this kind of sequence. In particular note that things that look like temperatures (so number + [FCK]) may get special treatment. This may seem odd but it usually results in better compatibility with existing corpora.

You can find out why an input is tokenized a particular way using tokenizer.explain() like so:

import spacy

nlp = spacy.blank("en")
print(nlp.tokenizer.explain("14K"))
print("...")
print(nlp.tokenizer.explain("14k"))

That gives the output:

[('TOKEN', '14'), ('SUFFIX', 'K')]
...
[('TOKEN', '14k')]

You can read more about this at the tokenizer.explain docs.

回复收藏 0 原文

~没有更多了~

关于作者

眼藏柔

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Spacy Matcher Is不总是匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

Spacy Matcher Is不总是匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。