我该如何做到这一点,以便我处理每个句子以查找并替换为匹配单词的同义词?
我目前正在与Spacy合作,并有一个像这样的语料库(包含960,256个单词):
['The EMD F7 was a 1,500 horsepower (1,100 kW) Diesel-electric locomotive produced between February 1949 and December 1953 by the Electro-Motive Division of General Motors (EMD) and General Motors Diesel (GMD). ',
'Third stream ',
"Gil Evans' influence ",
'The horn in the spotlight ',
'Contemporary horn in jazz ']
我有一个函数来查找单词的同义词(使用Spacy):
def most_similar(word, topn=5):
word = nlp.vocab[str(word)]
queries = [
w for w in word.vocab
if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
]
by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]
它返回了这样的答案,
[('dogs', 0.8835931), ('puppy', 0.85852146), ('pet', 0.8057451)]
因此,我有一种替换一个这样的单词的方法:
def replace_word(orig_text, replacement):
tok = nlp(orig_text)
text = ''
buffer_start = 0
for _, match_start, _ in matcher(tok):
if match_start > buffer_start: # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
text += replacement + tok[match_start].whitespace_ # Replace token, with trailing whitespace if available
buffer_start = match_start + 1
text += tok[buffer_start:].text
return text
它只是通过替换句子而起作用,然后是这样的单词:
replace_word("Hi this dog is my dog.", "Simba")
输出只是替换单词的句子:
Hi this Simba is my Simba.
在工作a
之前必须按照这样的定义:
matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])
或通过添加模式
,例如:
patterns = [
[{"LOWER": "amazing"}, {"LOWER": "anger"}, {"LOWER": "angry"}, {"LOWER": "answer"}, {"LOWER": "ask"}, {"LOWER": "awful"}, {"LOWER": "bad"}]
]
我想要的是抓住语料库,用句子将其喂入句子,然后逐字将其送达most_similar
so so我可以通过使用replace_word
来保存替换的单词列表,然后我不确定该怎么做。我已经尝试了一段时间,但总是以某种方式失败(要么不会批量进行,因此我不能一次做到,如果我简单地将每个句子拆分为.split('' ”)
所以...你能帮我吗?
I am currently working with spacy and have a corpus (containing 960,256 words) that looks like this:
['The EMD F7 was a 1,500 horsepower (1,100 kW) Diesel-electric locomotive produced between February 1949 and December 1953 by the Electro-Motive Division of General Motors (EMD) and General Motors Diesel (GMD). ',
'Third stream ',
"Gil Evans' influence ",
'The horn in the spotlight ',
'Contemporary horn in jazz ']
I have a function that looks for the synonym of a word (using spacy):
def most_similar(word, topn=5):
word = nlp.vocab[str(word)]
queries = [
w for w in word.vocab
if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
]
by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]
Which returns an array of answers like so:
[('dogs', 0.8835931), ('puppy', 0.85852146), ('pet', 0.8057451)]
Then, I have a method that replaces a word for another like this:
def replace_word(orig_text, replacement):
tok = nlp(orig_text)
text = ''
buffer_start = 0
for _, match_start, _ in matcher(tok):
if match_start > buffer_start: # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
text += replacement + tok[match_start].whitespace_ # Replace token, with trailing whitespace if available
buffer_start = match_start + 1
text += tok[buffer_start:].text
return text
It simply works by getting the sentence to replace, then a word like so:
replace_word("Hi this dog is my dog.", "Simba")
And the output is simply the sentence with the word replaced:
Hi this Simba is my Simba.
Before working a Matcher
has to be defined such as this:
matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])
or by adding patterns
such as:
patterns = [
[{"LOWER": "amazing"}, {"LOWER": "anger"}, {"LOWER": "angry"}, {"LOWER": "answer"}, {"LOWER": "ask"}, {"LOWER": "awful"}, {"LOWER": "bad"}]
]
What I want is to grab the corpus, feed it sentence by sentence and word by word to most_similar
so I can save the list of words to replace and do so by using replace_word
the thing is that I'm not sure how to do this. I've tried for a while but it always fails somehow (either won't take batches so I can't do it at once, the words end up being empty vectors if I simply split each sentence by .split(" ")
so...could you help me out please?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我希望我能正确理解您的需求。我猜您想:
如果是这样,那么您需要的是有效的相似性函数(我尝试了上面的功能,但对我不适合我),但是您可以尝试一下:
您还提到您希望它在语料库上运行。我建议您使用nlp.pipe()方法与set_extension方法结合使用。您可以这样做:
既然您拥有所有功能,就可以准备管道并在所有语料库上执行它:
I hope I understood what you need correctly. I'm guessing you want to:
If that's the case then what you need is a valid similarity function (I tried the one above but it didn't work for me properly) but you can try this:
You also mentioned that you want this to run on a corpus. I recommend that you use the nlp.pipe() method for performance gains combined with the set_extension method. You can do it like this:
Now that you have all of your functions, you can just prepare your pipeline and execute it on all of your corpus: