如何将矢量列表转换为numpy阵列以训练Python的分类器？

发布于 2025-01-28 09:33:39 字数 1672 浏览 1 评论 0原文

我有一个看起来像这样的熊猫数据框架：

                          corpus             tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  60
1   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  73
2   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  61

我所需的输出是：

                           corpus            tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    60
1   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    73
2   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    61

我想卸下列TFIDF，以创建一个numpy数组来训练决策树分类器。

x= df['tfidf'].values
y= df['labels'].values

from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 
0.25, random_state=0)  


from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', 
random_state=0)  
classifier.fit(x_train, y_train)

当我尝试上面的代码时，我会遇到一个错误：

TypeError                                 Traceback (most recent 
call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent 
call last)
<ipython-input-103-8aa769130bba> in <module>()
  1 from sklearn.tree import DecisionTreeClassifier
  2 classifier= DecisionTreeClassifier(criterion='entropy', 
random_state=0)
----> 3 classifier.fit(x_train, y_train)enter code here

我该怎么做才能准备好数据框架进行培训？

原文

I have a pandas data frame that looks like this:

                          corpus             tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  60
1   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  73
2   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  61

my desired output is this:

                           corpus            tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    60
1   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    73
2   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    61

I want to unlist the column tfidf in order to create a numpy array to train a decision tree classifier.

x= df['tfidf'].values
y= df['labels'].values

from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 
0.25, random_state=0)  


from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', 
random_state=0)  
classifier.fit(x_train, y_train)

When I tried the code above I got an error:

TypeError                                 Traceback (most recent 
call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent 
call last)
<ipython-input-103-8aa769130bba> in <module>()
  1 from sklearn.tree import DecisionTreeClassifier
  2 classifier= DecisionTreeClassifier(criterion='entropy', 
random_state=0)
----> 3 classifier.fit(x_train, y_train)enter code here

What can I do to get the data frame ready for training?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

少年亿悲伤 2025-02-04 09:33:39

您可以爆炸 tfidf 列中的列表到多行中，然后将这些值投入到numpy数组中，并适当地重塑其：

import numpy as np

n_rows = df.shape[0]
n_cols = len(df.loc[0, 'tfidf'])

X = np.array(df['tfidf'].explode().values,
             dtype='float').reshape(n_rows, n_cols)
X

array([[0.  , 0.  , 0.  , 0.01, 0.8 ],
       [0.  , 0.  , 0.  , 0.01, 0.8 ],
       [0.  , 0.  , 0.  , 0.01, 0.8 ]])

You can explode the lists from the tfidf column into multiple rows and then cast these values to a NumPy array, reshaping it appropriately:

import numpy as np

n_rows = df.shape[0]
n_cols = len(df.loc[0, 'tfidf'])

X = np.array(df['tfidf'].explode().values,
             dtype='float').reshape(n_rows, n_cols)
X

array([[0.  , 0.  , 0.  , 0.01, 0.8 ],
       [0.  , 0.  , 0.  , 0.01, 0.8 ],
       [0.  , 0.  , 0.  , 0.01, 0.8 ]])

回复收藏 0 原文

九局 2025-02-04 09:33:39

在tfidf列中的第一个显示

                          corpus             tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  60
 ...

值中，是python列表（通过字符串和numpy数组显示相同）。

df [col] .values将产生包含这些列表的1D对象dtype数组。

x = np.stack（df [col] .values）有机会将其转换为2D Float Dtype数组。

第二个“未列出”显示是无效的 - 除非您将字符串元素剥离[]。

                           corpus            tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    60

带有列表或数组元素的数据范围是一个厌恶的东西，许多初学者用户尚未准备好处理它们。当单元格值为字符串或数字时，帧最容易使用。但是，即使串也被存储为Python对象。

In the first display

                          corpus             tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    [0.0, 0.0, 0.0, 0.01, 0.8]  60
 ...

values in the tfidf column are Python lists (through strings and numpy arrays display the same).

df[col].values will produce a 1d object dtype array containing these lists.

x = np.stack(df[col].values) has a chance of turning that into a 2d float dtype array.

The second "unlisted" display is not valid - unless you strip the [] off string elements.

                           corpus            tfidf            labels
0   dfnkdfnkf asdfhedfh ajdladja    0.0, 0.0, 0.0, 0.01, 0.8    60

Dataframes with list or array elements are something of an anomoly, and many beginner user aren't prepared to deal with them. Frames are easiest to work with when the cell values are strings or numbers. But even strings are stored as Python objects.

回复收藏 0 原文

~没有更多了~