获取pandas中两列之间的单词索引
我正在使用 .has_vector 方法检查 SpaCy 西班牙语词形还原器适用于哪些单词。在数据名的两列中,我有函数的输出,该输出指示哪些单词可以进行词形还原,另一列中有相应的短语。
我想知道如何提取所有具有 False 输出的单词来纠正它们,以便我可以进行词形还原。
所以我创建了这个函数:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
并将其应用到 DataFrame 中的列句子
df["Vectors"] = df.reviews.apply(lemmatizer)
并放入另一个数据框中:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
输出是
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有两种方法可以做到这一点:
如果您想使用 has_vector:
或者您可以使用 is_oov attribute:
然后就像您已经做的那样:
它将返回:
注意:
使用这两种方式时重要的是要知道这是依赖于模型的,并且通常在较小的模型中没有骨干并且将总是返回默认值!
这意味着当您运行完全相同的代码但例如使用
en_core_web_sm
时,您会得到以下结果:这是因为
has_vector
的默认值为False
并且是那么不是由模型设置的。is_oov
的默认值为True
,并且也不是由模型决定的。因此,使用has_vector
模型,它错误地将所有单词显示为未知单词,而使用is_oov
模型,它错误地将所有单词显示为已知单词。Two ways to do this:
If you want to use has_vector:
Alternatively you can use the is_oov attribute:
Then as you already did:
Which will return:
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with
en_core_web_sm
you get this:Which is because
has_vector
has a default value ofFalse
and is then not set by the model.is_oov
has a default value ofTrue
and then is not by the model either. So with thehas_vector
model it wrongly shows all words as unknown and withis_oov
it wrongly shows all as known.