将 pandas 数据帧中的字符串（类别）数组转换为 int 数组

发布于 2024-12-10 15:13:31 字数 1850 浏览 0 评论 0原文

我正在尝试做一些与上一个问题但我收到错误。我有一个包含特征、标签的 pandas 数据框，我需要进行一些转换以将特征和标签变量发送到机器学习对象中：

import pandas
import milk
from scikits.statsmodels.tools import categorical

然后我有：

trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'

labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets

输出控制台首先产生：

features
[[ 0.38846334  0.97681855]
[ 3.8318634   0.5724734 ]
[ 0.67710876  1.01816444]
[ 1.12024943  0.91508699]
[ 7.51749674  1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]

我遇到以下错误：

Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'

是否可以将数据框中的类别变量“type”转换为 int 类型？ “type”可以采用值“single”、“touching”、“nuclei”、“dusts”，我需要使用 int 值进行转换，例如 0、1、2、3。

原文

I am trying to do something very similar to that previous question but I get an error.
I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:

import pandas
import milk
from scikits.statsmodels.tools import categorical

then I have:

trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'

labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets

The output console yields first:

features
[[ 0.38846334  0.97681855]
[ 3.8318634   0.5724734 ]
[ 0.67710876  1.01816444]
[ 1.12024943  0.91508699]
[ 7.51749674  1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]

I meet then the following error:

Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'

Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

逆光下的微笑 2024-12-17 15:13:31

之前的答案已经过时了，所以这里有一个将字符串映射到数字的解决方案，适用于 Pandas 0.18.1 版本。

对于系列：

In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
                       'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')

对于数据框：

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei', 
                       'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]), 
        Index([u'single', u'touching', u'nuclei', u'dusts'],
        dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
         labels  labels_enc
    0    single           0
    1  touching           1
    2    nuclei           2
    3     dusts           3
    4  touching           1
    5    single           0
    6    nuclei           2

The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.

For a Series:

In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
                       'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')

For a DataFrame:

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei', 
                       'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]), 
        Index([u'single', u'touching', u'nuclei', u'dusts'],
        dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
         labels  labels_enc
    0    single           0
    1  touching           1
    2    nuclei           2
    3     dusts           3
    4  touching           1
    5    single           0
    6    nuclei           2

回复收藏 0 原文

伪装你 2024-12-17 15:13:31

如果您有一个字符串或其他对象的向量，并且想要为其提供分类标签，则可以使用 Factor 类（在 pandas 命名空间中提供）：

In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])

In [2]: s
Out[2]: 
0    single
1    touching
2    nuclei
3    dusts
4    touching
5    single
6    nuclei
Name: None, Length: 7

In [4]: Factor(s)
Out[4]: 
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]

该因子具有属性 labels 和 levels：

In [7]: f = Factor(s)

In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)

In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)

这是用于一维向量的，因此不确定它是否可以立即应用于您的问题，但请看一下。

顺便说一句，我建议您在 statsmodels 和/或 scikit-learn 邮件列表上提出这些问题，因为我们大多数人都不是 SO 的频繁用户。

If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):

In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])

In [2]: s
Out[2]: 
0    single
1    touching
2    nuclei
3    dusts
4    touching
5    single
6    nuclei
Name: None, Length: 7

In [4]: Factor(s)
Out[4]: 
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]

The factor has attributes labels and levels:

In [7]: f = Factor(s)

In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)

In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)

This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.

BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.

回复收藏 0 原文

熟人话多 2024-12-17 15:13:31

我正在回答 Pandas 0.10.1 的问题。 Factor.from_array 似乎可以解决问题。

>>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
>>> s
0    a
1    b
2    a
3    c
4    a
5    b
6    a
>>> f = pandas.Factor.from_array(s)
>>> f
Categorical: 
array([a, b, a, c, a, b, a], dtype=object)
Levels (3): Index([a, b, c], dtype=object)
>>> f.labels
array([0, 1, 0, 2, 0, 1, 0])
>>> f.levels
Index([a, b, c], dtype=object)

I am answering the question for Pandas 0.10.1. Factor.from_array seems to do the trick.

>>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
>>> s
0    a
1    b
2    a
3    c
4    a
5    b
6    a
>>> f = pandas.Factor.from_array(s)
>>> f
Categorical: 
array([a, b, a, c, a, b, a], dtype=object)
Levels (3): Index([a, b, c], dtype=object)
>>> f.labels
array([0, 1, 0, 2, 0, 1, 0])
>>> f.levels
Index([a, b, c], dtype=object)

回复收藏 0 原文

神也荒唐 2024-12-17 15:13:31

因为这些都不适用于维度 >1，所以我编写了一些适用于任何 numpy 数组维度的代码：

def encode_categorical(array):
    d = {key: value for (key, value) in zip(np.unique(array), np.arange(len(u)))}
    shape = array.shape
    array = array.ravel()
    new_array = np.zeros(array.shape, dtype=np.int)
    for i in range(len(array)):
        new_array[i] = d[array[i]]
    return new_array.reshape(shape)

because none of these work for dimensions>1, I made some code working for any numpy array dimensionality:

def encode_categorical(array):
    d = {key: value for (key, value) in zip(np.unique(array), np.arange(len(u)))}
    shape = array.shape
    array = array.ravel()
    new_array = np.zeros(array.shape, dtype=np.int)
    for i in range(len(array)):
        new_array[i] = d[array[i]]
    return new_array.reshape(shape)

回复收藏 0 原文

~没有更多了~