获取pandas中两列之间的单词索引

发布于 2025-01-19 08:14:59 字数 639 浏览 0 评论 0原文

我正在使用 .has_vector 方法检查 SpaCy 西班牙语词形还原器适用于哪些单词。在数据名的两列中，我有函数的输出，该输出指示哪些单词可以进行词形还原，另一列中有相应的短语。

我想知道如何提取所有具有 False 输出的单词来纠正它们，以便我可以进行词形还原。

所以我创建了这个函数：

def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])

并将其应用到 DataFrame 中的列句子

df["Vectors"] = df.reviews.apply(lemmatizer)

并放入另一个数据框中：

df2= pd.DataFrame(df[['Vectors', 'reviews']])

输出是

index             Vectors              reviews
  1     True True True False        'La pelicula es aburridora'

原文

I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.

I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.

So I created the function:

def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])

And applied it to the column sentences in the DataFrame

df["Vectors"] = df.reviews.apply(lemmatizer)

And put in another data frame as:

df2= pd.DataFrame(df[['Vectors', 'reviews']])

The output is

index             Vectors              reviews
  1     True True True False        'La pelicula es aburridora'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝眸 2025-01-26 08:14:59

有两种方法可以做到这一点：

import pandas
import spacy

nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})

如果您想使用 has_vector：

def get_oov1(text):
    return [word.text for word in nlp(text) if not word.has_vector]

或者您可以使用 is_oov attribute：

def get_oov2(text):
    return [word.text for word in nlp(text) if word.is_oov]

然后就像您已经做的那样：

df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)

它将返回：

>                                   reviews              oov_words1              oov_words2
  0  aaabbbcccc some example words xxxxyyyz  [aaabbbcccc, xxxxyyyz]  [aaabbbcccc, xxxxyyyz]

注意：

使用这两种方式时重要的是要知道这是依赖于模型的，并且通常在较小的模型中没有骨干并且将总是返回默认值！

这意味着当您运行完全相同的代码但例如使用 en_core_web_sm 时，您会得到以下结果：

>                                  reviews oov_words1                                    oov_words2
  0  aaabbbcccc some example words xxxxyyyz         []  [aaabbbcccc, some, example, words, xxxxyyyz]

这是因为 has_vector 的默认值为 False 并且是那么不是由模型设置的。 is_oov 的默认值为 True，并且也不是由模型决定的。因此，使用 has_vector 模型，它错误地将所有单词显示为未知单词，而使用 is_oov 模型，它错误地将所有单词显示为已知单词。

Two ways to do this:

import pandas
import spacy

nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})

If you want to use has_vector:

def get_oov1(text):
    return [word.text for word in nlp(text) if not word.has_vector]

Alternatively you can use the is_oov attribute:

def get_oov2(text):
    return [word.text for word in nlp(text) if word.is_oov]

Then as you already did:

df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)

Which will return:

>                                   reviews              oov_words1              oov_words2
  0  aaabbbcccc some example words xxxxyyyz  [aaabbbcccc, xxxxyyyz]  [aaabbbcccc, xxxxyyyz]

Note:

When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!

That means when you run the exact same code but e.g. with en_core_web_sm you get this:

>                                  reviews oov_words1                                    oov_words2
  0  aaabbbcccc some example words xxxxyyyz         []  [aaabbbcccc, some, example, words, xxxxyyyz]

Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.

回复收藏 0 原文

~没有更多了~