SCI-KIT TF-IDF-不确定TD-IDF数组的解释吗？

发布于 2025-01-27 21:31:45 字数 1165 浏览 2 评论 0原文

我有一个数据框的子集，例如：

<OUT>
PageNumber    Top_words_only
56            people sun flower festival 
75            sunflower sun architecture red buses festival

我想在English_tags DF列上计算TF-IDF，每行充当文档。我已经尝试过：

Vectorizer = TfidfVectorizer(lowercase = True, max_df = 0.8, min_df = 5, stop_words = 'english')
Vectors = Vectorizer.fit_transform(df['top_words_only'])

如果我打印数组，它会出现：

array([[0.        , 0.        , 0.        , ..., 0.        , 0.35588179,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

但是我对这意味着什么感到有些困惑 - 为什么有这么多的o值？实现tfidfvectorizer（）是否会考虑所有文档（IE corpus）自动计算每个标签的TF-IDF值？

原文

I have a subset of a dataframe like:

<OUT>
PageNumber    Top_words_only
56            people sun flower festival 
75            sunflower sun architecture red buses festival

I want to calculate TF-IDF on the English_tags df column with each row acting as a document. I have tried:

Vectorizer = TfidfVectorizer(lowercase = True, max_df = 0.8, min_df = 5, stop_words = 'english')
Vectors = Vectorizer.fit_transform(df['top_words_only'])

If I print the array it comes out as:

array([[0.        , 0.        , 0.        , ..., 0.        , 0.35588179,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

But I am a little confused by what this means - why are there so many o values? Does implementing TfidfVectorizer() automatically calculate the TF-IDF values for each tag taking into account all documents (i.e. corpus)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半仙 2025-02-03 21:31:45

调用fit_transform计算每个提供的文档的向量。每个向量的大小相同。向量的大小是所提供文档中唯一单词的数量。向量中的零值数将是向量大小 - 文档中唯一值的数量。

将您的top_words作为一个简单的示例。您显示2个文档：

'people sun flower festival'
'sunflower sun architecture red buses festival'

这些总共有8个唯一单词（vectorizer.get_feature_names_out（）将为您提供这些）：

'architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower'

呼叫fit_transform，其中这2个文档将提供2个矢量（ 1对于每个文档），每个文档的长度为8（文档中的唯一单词数）。

第一个文档'People Sun Flower Festival'有4个单词，因此，矢量中有4个值，4个零值。同样，'向日葵太阳建筑红色巴士节'提供6个值和2个零。

您用不同的单词传递的文档越多，向量得出的时间越长，零的可能性就越大。

from sklearn.feature_extraction.text import TfidfVectorizer

top_words = ['people sun flower festival', 'sunflower sun architecture red buses festival']

Vectorizer = TfidfVectorizer()
Vectors = Vectorizer.fit_transform(top_words)

print(f'Feature names: {Vectorizer.get_feature_names_out().tolist()}')
tfidf = Vectors.toarray()
print('')
print(f'top_words[0] = {top_words[0]}')
print(f'tfidf[0] = {tfidf[0].tolist()}')
print('')
print(f'top_words[1] = {top_words[1]}')
print(f'tfidf[1] = {tfidf[1].tolist()}')

以上代码将打印：

Feature names: ['architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower']

top_words[0] = people sun flower festival
tfidf[0] = [0.0, 0.0, 0.40993714596036396, 0.5761523551647353, 0.5761523551647353, 0.0, 0.40993714596036396, 0.0]

top_words[1] = sunflower sun architecture red buses festival
tfidf[1] = [0.4466561618018052, 0.4466561618018052, 0.31779953783628945, 0.0, 0.0, 0.4466561618018052, 0.31779953783628945, 0.4466561618018052]

Calling fit_transform calculates a vector for each supplied document. Each vector will be the same size. The size of the vector is the number of unique words across the supplied documents. The number of zero values in the vector will be the vector size - number of unique values in the document.

Using your top_words as a simple example. You show 2 documents:

'people sun flower festival'
'sunflower sun architecture red buses festival'

These have a total of 8 unique words (Vectorizer.get_feature_names_out() will give you these):

'architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower'

Calling fit_transform with those 2 documents will give 2 vectors (1 for each doc), each with length 8 (number of unique words across the documents).

The first document, 'people sun flower festival' has 4 words, so, you get 4 values in the vector, and 4 zeros. Similarly 'sunflower sun architecture red buses festival' gives 6 values and 2 zeros.

The more documents you pass in with different words, the longer the vector gets, and the more likely the zeros are.

from sklearn.feature_extraction.text import TfidfVectorizer

top_words = ['people sun flower festival', 'sunflower sun architecture red buses festival']

Vectorizer = TfidfVectorizer()
Vectors = Vectorizer.fit_transform(top_words)

print(f'Feature names: {Vectorizer.get_feature_names_out().tolist()}')
tfidf = Vectors.toarray()
print('')
print(f'top_words[0] = {top_words[0]}')
print(f'tfidf[0] = {tfidf[0].tolist()}')
print('')
print(f'top_words[1] = {top_words[1]}')
print(f'tfidf[1] = {tfidf[1].tolist()}')

The above code will print:

Feature names: ['architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower']

top_words[0] = people sun flower festival
tfidf[0] = [0.0, 0.0, 0.40993714596036396, 0.5761523551647353, 0.5761523551647353, 0.0, 0.40993714596036396, 0.0]

top_words[1] = sunflower sun architecture red buses festival
tfidf[1] = [0.4466561618018052, 0.4466561618018052, 0.31779953783628945, 0.0, 0.0, 0.4466561618018052, 0.31779953783628945, 0.4466561618018052]

回复收藏 0 原文

~没有更多了~