我该如何做到这一点，以便我处理每个句子以查找并替换为匹配单词的同义词？

发布于 2025-02-02 02:48:48 字数 2347 浏览 2 评论 0原文

我目前正在与Spacy合作，并有一个像这样的语料库（包含960,256个单词）：

['The EMD F7 was a 1,500 horsepower (1,100 kW) Diesel-electric locomotive produced between February 1949 and December 1953 by the Electro-Motive Division of General Motors (EMD) and General Motors Diesel (GMD). ',
 'Third stream ',
 "Gil Evans' influence ",
 'The horn in the spotlight ',
 'Contemporary horn in jazz ']

我有一个函数来查找单词的同义词（使用Spacy）：

def most_similar(word, topn=5):
    word = nlp.vocab[str(word)]
    queries = [
        w for w in word.vocab 
        if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
    ]
  
    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

它返回了这样的答案，

[('dogs', 0.8835931), ('puppy', 0.85852146), ('pet', 0.8057451)]

因此，我有一种替换一个这样的单词的方法：

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

它只是通过替换句子而起作用，然后是这样的单词：

replace_word("Hi this dog is my dog.", "Simba")

输出只是替换单词的句子：

Hi this Simba is my Simba.

在工作a之前必须按照这样的定义：

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

或通过添加模式，例如：

patterns = [
[{"LOWER": "amazing"}, {"LOWER": "anger"}, {"LOWER": "angry"}, {"LOWER": "answer"}, {"LOWER": "ask"}, {"LOWER": "awful"}, {"LOWER": "bad"}]
]

我想要的是抓住语料库，用句子将其喂入句子，然后逐字将其送达most_similar so so我可以通过使用replace_word来保存替换的单词列表，然后我不确定该怎么做。我已经尝试了一段时间，但总是以某种方式失败（要么不会批量进行，因此我不能一次做到，如果我简单地将每个句子拆分为.split（'' ”）所以...你能帮我吗？

原文

I am currently working with spacy and have a corpus (containing 960,256 words) that looks like this:

['The EMD F7 was a 1,500 horsepower (1,100 kW) Diesel-electric locomotive produced between February 1949 and December 1953 by the Electro-Motive Division of General Motors (EMD) and General Motors Diesel (GMD). ',
 'Third stream ',
 "Gil Evans' influence ",
 'The horn in the spotlight ',
 'Contemporary horn in jazz ']

I have a function that looks for the synonym of a word (using spacy):

def most_similar(word, topn=5):
    word = nlp.vocab[str(word)]
    queries = [
        w for w in word.vocab 
        if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
    ]
  
    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

Which returns an array of answers like so:

[('dogs', 0.8835931), ('puppy', 0.85852146), ('pet', 0.8057451)]

Then, I have a method that replaces a word for another like this:

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

It simply works by getting the sentence to replace, then a word like so:

replace_word("Hi this dog is my dog.", "Simba")

And the output is simply the sentence with the word replaced:

Hi this Simba is my Simba.

Before working a Matcher has to be defined such as this:

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

or by adding patterns such as:

patterns = [
[{"LOWER": "amazing"}, {"LOWER": "anger"}, {"LOWER": "angry"}, {"LOWER": "answer"}, {"LOWER": "ask"}, {"LOWER": "awful"}, {"LOWER": "bad"}]
]

What I want is to grab the corpus, feed it sentence by sentence and word by word to most_similar so I can save the list of words to replace and do so by using replace_wordthe thing is that I'm not sure how to do this. I've tried for a while but it always fails somehow (either won't take batches so I can't do it at once, the words end up being empty vectors if I simply split each sentence by .split(" ") so...could you help me out please?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乙白 2025-02-09 02:48:48

我希望我能正确理解您的需求。我猜您想：

通过语料库迭代
使用Matcher查找特定令牌
，查找匹配令牌的同义词
返回新的句子列表，但更换了令牌。

如果是这样，那么您需要的是有效的相似性函数（我尝试了上面的功能，但对我不适合我），但是您可以尝试一下：

def most_similar(word, topn):
    words = []
    target = nlp.vocab.strings[word]
    if target in nlp.vocab.vectors:
        synonyms = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[target]]), n=topn)
        words = [nlp.vocab.strings[w].lower() for w in synonyms[0][0] if nlp.vocab.strings[w].lower() != word.lower()]
    return words

您还提到您希望它在语料库上运行。我建议您使用nlp.pipe（）方法与set_extension方法结合使用。您可以这样做：

# First of all we create a component to add to the pipe
@Language.component("synonym_replacer")
def synonym_replacer(doc):
    if not Doc.has_extension("synonyms"):
        Doc.set_extension("synonyms", default=[])
    doc._.synonyms.extend(list(replace_synonyms(doc, 4)))
    return doc

# This will replace matched tokens by their synonyms
def replace_synonyms(doc, topn):
    for sent in doc.sents:
        matches = matcher(sent)
        for _, start, end in matches:
            span = sent[start:end]
            syns = most_similar(span.text, topn)
            for syn in syns:
                yield nlp.make_doc(sent[:start].text_with_ws + f"{syn} " + sent[end:].text_with_ws)

既然您拥有所有功能，就可以准备管道并在所有语料库上执行它：

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("synonym_replacer")
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "dog"}]]
matcher.add("dog", patterns)

corpus = ["I have a great dog", "Hi this dog is my dog."]
docs = nlp.pipe(corpus)

for doc in docs:
    print(doc.text)
    print(doc._.synonyms)
    print("****")

# Output
# I have a great dog
# [I have a great dogs , I have a great puppy , I have a great pet ]
# ****
# Hi this dog is my dog.
# [Hi this dogs is my dog., Hi this puppy is my dog., Hi this pet is my dog., #Hi this dog is my dogs ., Hi this dog is my puppy ., Hi this dog is my pet .]
# ****

I hope I understood what you need correctly. I'm guessing you want to:

Iterate over a corpus
Find specific tokens using the matcher
Find synonyms of the matched tokens
Return a new list of sentences but with the replaced tokens.

If that's the case then what you need is a valid similarity function (I tried the one above but it didn't work for me properly) but you can try this:

def most_similar(word, topn):
    words = []
    target = nlp.vocab.strings[word]
    if target in nlp.vocab.vectors:
        synonyms = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[target]]), n=topn)
        words = [nlp.vocab.strings[w].lower() for w in synonyms[0][0] if nlp.vocab.strings[w].lower() != word.lower()]
    return words

You also mentioned that you want this to run on a corpus. I recommend that you use the nlp.pipe() method for performance gains combined with the set_extension method. You can do it like this:

# First of all we create a component to add to the pipe
@Language.component("synonym_replacer")
def synonym_replacer(doc):
    if not Doc.has_extension("synonyms"):
        Doc.set_extension("synonyms", default=[])
    doc._.synonyms.extend(list(replace_synonyms(doc, 4)))
    return doc

# This will replace matched tokens by their synonyms
def replace_synonyms(doc, topn):
    for sent in doc.sents:
        matches = matcher(sent)
        for _, start, end in matches:
            span = sent[start:end]
            syns = most_similar(span.text, topn)
            for syn in syns:
                yield nlp.make_doc(sent[:start].text_with_ws + f"{syn} " + sent[end:].text_with_ws)

Now that you have all of your functions, you can just prepare your pipeline and execute it on all of your corpus:

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("synonym_replacer")
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "dog"}]]
matcher.add("dog", patterns)

corpus = ["I have a great dog", "Hi this dog is my dog."]
docs = nlp.pipe(corpus)

for doc in docs:
    print(doc.text)
    print(doc._.synonyms)
    print("****")

# Output
# I have a great dog
# [I have a great dogs , I have a great puppy , I have a great pet ]
# ****
# Hi this dog is my dog.
# [Hi this dogs is my dog., Hi this puppy is my dog., Hi this pet is my dog., #Hi this dog is my dogs ., Hi this dog is my puppy ., Hi this dog is my pet .]
# ****

回复收藏 0 原文

~没有更多了~