如何将矢量列表转换为numpy阵列以训练Python的分类器?
我有一个看起来像这样的熊猫数据框架:
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 60
1 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 73
2 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 61
我所需的输出是:
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 60
1 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 73
2 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 61
我想卸下列TFIDF,以创建一个numpy数组来训练决策树分类器。
x= df['tfidf'].values
y= df['labels'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.25, random_state=0)
from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy',
random_state=0)
classifier.fit(x_train, y_train)
当我尝试上面的代码时,我会遇到一个错误:
TypeError Traceback (most recent
call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent
call last)
<ipython-input-103-8aa769130bba> in <module>()
1 from sklearn.tree import DecisionTreeClassifier
2 classifier= DecisionTreeClassifier(criterion='entropy',
random_state=0)
----> 3 classifier.fit(x_train, y_train)enter code here
我该怎么做才能准备好数据框架进行培训?
I have a pandas data frame that looks like this:
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 60
1 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 73
2 dfnkdfnkf asdfhedfh ajdladja [0.0, 0.0, 0.0, 0.01, 0.8] 61
my desired output is this:
corpus tfidf labels
0 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 60
1 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 73
2 dfnkdfnkf asdfhedfh ajdladja 0.0, 0.0, 0.0, 0.01, 0.8 61
I want to unlist the column tfidf in order to create a numpy array to train a decision tree classifier.
x= df['tfidf'].values
y= df['labels'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.25, random_state=0)
from sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy',
random_state=0)
classifier.fit(x_train, y_train)
When I tried the code above I got an error:
TypeError Traceback (most recent
call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent
call last)
<ipython-input-103-8aa769130bba> in <module>()
1 from sklearn.tree import DecisionTreeClassifier
2 classifier= DecisionTreeClassifier(criterion='entropy',
random_state=0)
----> 3 classifier.fit(x_train, y_train)enter code here
What can I do to get the data frame ready for training?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以
爆炸
tfidf 列中的列表到多行中,然后将这些值投入到numpy数组中,并适当地重塑其:You can
explode
the lists from the tfidf column into multiple rows and then cast these values to a NumPy array, reshaping it appropriately:在
tfidf
列中的第一个显示值中,是python列表(通过字符串和numpy数组显示相同)。
df [col] .values
将产生包含这些列表的1D对象dtype数组。x = np.stack(df [col] .values)
有机会将其转换为2D Float Dtype数组。第二个“未列出”显示是无效的 - 除非您将字符串元素剥离
[]
。带有列表或数组元素的数据范围是一个厌恶的东西,许多初学者用户尚未准备好处理它们。当单元格值为字符串或数字时,帧最容易使用。但是,即使串也被存储为Python对象。
In the first display
values in the
tfidf
column are Python lists (through strings and numpy arrays display the same).df[col].values
will produce a 1d object dtype array containing these lists.x = np.stack(df[col].values)
has a chance of turning that into a 2d float dtype array.The second "unlisted" display is not valid - unless you strip the
[]
off string elements.Dataframes with list or array elements are something of an anomoly, and many beginner user aren't prepared to deal with them. Frames are easiest to work with when the cell values are strings or numbers. But even strings are stored as Python objects.