如何匹配spacy中的重复模式？

发布于 2025-01-12 12:01:40 字数 1094 浏览 3 评论 0原文

我有一个与在这篇文章中提出的类似问题：如何在 spacy 中定义由多个标记组成的重复模式？ 我的情况与链接帖子相比的区别在于我的模式是由 POS 和依赖标记定义的。因此，我认为我无法轻松使用正则表达式来解决我的问题（正如链接帖子的接受答案中所建议的那样）。

例如，假设我们分析以下句子：

“她告诉我她的狗又大又黑又强壮。”

以下代码将允许我匹配句子末尾的形容词列表：

import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")

# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])


matches = matcher(doc)

运行此代码将匹配“big, black and Strong”。但是，此模式不会在以下句子中找到形容词列表“她告诉我她的狗又大又黑”或“她告诉我她的狗又大又黑，强壮又顽皮”。

我必须如何为 spacy 的匹配器定义一个（单个）模式才能找到这样一个包含任意数量形容词的列表？换句话说，我正在寻找一种模式的正确语法，其中 {"POS": "ADJ"}, {"IS_PUNCT": True} 部分可以在列表结束之前任意重复模式 {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}。

感谢您的任何提示。

原文

I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).

For example, let's assume we analyze the following sentence:

"She told me that her dog was big, black and strong."

The following code would allow me to match the list of adjectives at the end of the sentence:

import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")

# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])


matches = matcher(doc)

Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".

How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.

Thanks for any hints.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

同尘 2025-01-19 12:01:40

解决方案/问题与所链接的问题没有本质上的不同，没有在这样的比赛中重复多令牌模式的设施。您可以使用 for 循环构建多个模式来捕获您想要的内容。

patterns = []
for ii in range(1, 5):
    pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
    pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
    patterns.append(pattern)

或者，您可以使用依赖项匹配器执行某些操作。在您的示例句子中，它不是那么干净，但是对于像“It was a big,brown,顽皮的狗”这样的句子，形容词都有直接将它们连接到名词的依赖弧。

作为单独的说明，您没有处理带有串行逗号的句子。

The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.

patterns = []
for ii in range(1, 5):
    pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
    pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
    patterns.append(pattern)

Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.

As a separate note, you are not handling sentences with the serial comma.

回复收藏 0 原文

~没有更多了~