SCI-KIT TF-IDF-不确定TD-IDF数组的解释吗?
我有一个数据框的子集,例如:
<OUT>
PageNumber Top_words_only
56 people sun flower festival
75 sunflower sun architecture red buses festival
我想在English_tags
DF列上计算TF-IDF,每行充当文档。我已经尝试过:
Vectorizer = TfidfVectorizer(lowercase = True, max_df = 0.8, min_df = 5, stop_words = 'english')
Vectors = Vectorizer.fit_transform(df['top_words_only'])
如果我打印数组,它会出现:
array([[0. , 0. , 0. , ..., 0. , 0.35588179,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ]])
但是我对这意味着什么感到有些困惑 - 为什么有这么多的o值?实现tfidfvectorizer()
是否会考虑所有文档(IE corpus)自动计算每个标签的TF-IDF值?
I have a subset of a dataframe like:
<OUT>
PageNumber Top_words_only
56 people sun flower festival
75 sunflower sun architecture red buses festival
I want to calculate TF-IDF on the English_tags
df column with each row acting as a document. I have tried:
Vectorizer = TfidfVectorizer(lowercase = True, max_df = 0.8, min_df = 5, stop_words = 'english')
Vectors = Vectorizer.fit_transform(df['top_words_only'])
If I print the array it comes out as:
array([[0. , 0. , 0. , ..., 0. , 0.35588179,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ]])
But I am a little confused by what this means - why are there so many o values? Does implementing TfidfVectorizer()
automatically calculate the TF-IDF values for each tag taking into account all documents (i.e. corpus)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
调用
fit_transform
计算每个提供的文档的向量。每个向量的大小相同。向量的大小是所提供文档中唯一单词的数量。向量中的零值数将是向量大小 - 文档中唯一值的数量。将您的top_words作为一个简单的示例。您显示2个文档:
这些总共有8个唯一单词(
vectorizer.get_feature_names_out()
将为您提供这些):呼叫
fit_transform
,其中这2个文档将提供2个矢量( 1对于每个文档),每个文档的长度为8(文档中的唯一单词数)。第一个文档
'People Sun Flower Festival'
有4个单词,因此,矢量中有4个值,4个零值。同样,'向日葵太阳建筑红色巴士节'
提供6个值和2个零。您用不同的单词传递的文档越多,向量得出的时间越长,零的可能性就越大。
以上代码将打印:
Calling
fit_transform
calculates a vector for each supplied document. Each vector will be the same size. The size of the vector is the number of unique words across the supplied documents. The number of zero values in the vector will be the vector size - number of unique values in the document.Using your top_words as a simple example. You show 2 documents:
These have a total of 8 unique words (
Vectorizer.get_feature_names_out()
will give you these):Calling
fit_transform
with those 2 documents will give 2 vectors (1 for each doc), each with length 8 (number of unique words across the documents).The first document,
'people sun flower festival'
has 4 words, so, you get 4 values in the vector, and 4 zeros. Similarly'sunflower sun architecture red buses festival'
gives 6 values and 2 zeros.The more documents you pass in with different words, the longer the vector gets, and the more likely the zeros are.
The above code will print: