模式匹配中的 SPACY 否定运算符

发布于 2025-01-16 18:50:59 字数 722 浏览 1 评论 0原文

我正在尝试在 spaCy 中编写一个与“黑色”匹配但不与“黑豆”匹配的模式。

我尝试了下面的代码,但它似乎与“black”旁边的标记匹配,只要它不是“bean”。如何修改以仅匹配“黑色”?

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

#pattern = [{"LOWER": "black"}, {"LEMMA": {"NOT_IN": ["bean", "beans"]}}]
pattern = [{"LOWER": "black"}, {"LEMMA": "bean", "OP": "!"}]
matcher.add("blackbeans", [pattern])

doc = nlp("I liked the black beans, but the avocado was black making the whole meal blackish-looking and not good.")

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

I am trying to write a pattern in spaCy that matches against "black" but not "black beans."

I tried the code below, but it seems to match the token that is next to "black" so long as it is not "bean." How do I modify to match against only "black"?

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

#pattern = [{"LOWER": "black"}, {"LEMMA": {"NOT_IN": ["bean", "beans"]}}]
pattern = [{"LOWER": "black"}, {"LEMMA": "bean", "OP": "!"}]
matcher.add("blackbeans", [pattern])

doc = nlp("I liked the black beans, but the avocado was black making the whole meal blackish-looking and not good.")

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冷弦 2025-01-23 18:50:59

没有办法做到这一点 - 匹配器返回输入模式描述的每个标记。否定模式也不匹配非标记,因此如果“black”是句子中的最后一个标记,则您的模式将失败。

有几种方法可以解决此问题:

  1. 您始终可以匹配“黑色”并对匹配进行后处理。这非常简单,尽管有些人不喜欢它,因为它不使用匹配器。
  2. 您可以使用对齐功能匹配器,它告诉您每个标记匹配模式的哪一部分。这主要适用于比您现在使用的模式更复杂的模式。

There's no way to do this - the Matcher returns every token described by the input pattern. The negation pattern also doesn't match non-tokens, so your pattern will fail if "black" is the last token in a sentence.

There are a couple of ways to work around this:

  1. You can always match "black" and post-process the matches. This is very simple, though some people don't like it because it doesn't use the Matcher.
  2. You can use the alignments feature of the Matcher, which tells you which part of the pattern each token matches. This is mainly useful with more complex patterns than the one you're using now.
幸福还没到 2025-01-23 18:50:59
pattern = [{"LOWER": "black"}, {"LOWER": {"NOT_IN": ["bean", "beans"]}, "OP" : "?"}]
pattern = [{"LOWER": "black"}, {"LOWER": {"NOT_IN": ["bean", "beans"]}, "OP" : "?"}]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文