更改Pandas中的列类型

发布于 2025-02-11 00:37:01 字数 368 浏览 2 评论 0 原文

我从列表列表中创建了一个数据框:

table = [
    ['a',  '1.2',  '4.2' ],
    ['b',  '70',   '0.03'],
    ['x',  '5',    '0'   ],
]

df = pd.DataFrame(table)

如何将列转换为特定类型?在这种情况下,我想将第2和3列转换为浮子。

有没有办法在将列表转换为数据框时指定类型?还是最好先创建数据帧,然后循环循环以更改每列的DTYPE?理想情况下,我想以动态的方式进行此操作,因为可以有数百列,而且我不想确切指定哪种类型的列。我只能保证,每列包含相同类型的值。

I created a DataFrame from a list of lists:

table = [
    ['a',  '1.2',  '4.2' ],
    ['b',  '70',   '0.03'],
    ['x',  '5',    '0'   ],
]

df = pd.DataFrame(table)

How do I convert the columns to specific types? In this case, I want to convert columns 2 and 3 into floats.

Is there a way to specify the types while converting the list to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the dtype for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns, and I don't want to specify exactly which columns are of which type. All I can guarantee is that each column contains values of the same type.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

苍景流年 2025-02-18 00:37:02

使用以下方式:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df

Out[16]:
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes

Out[17]:
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes

Out[19]:
one       object
two      float64
three    float64

Use this:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df

Out[16]:
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes

Out[17]:
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes

Out[19]:
one       object
two      float64
three    float64
黒涩兲箜 2025-02-18 00:37:02

以下代码将更改列的数据类型。

df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')

代替数据类型,您可以为数据类型提供所需的内容,例如str,float,int等。

This below code will change the datatype of a column.

df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')

In place of the data type, you can give your datatype what you want, like, str, float, int, etc.

唯憾梦倾城 2025-02-18 00:37:02

当我只需要指定特定的列而要明确时,我已经使用过(per pandas.dataframe.astype ):

dataframe = dataframe.astype({'col_name_1':'int','col_name_2':'float64', etc. ...})

因此,使用原始问题,但为其提供列名...

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col_name_1', 'col_name_2', 'col_name_3'])
df = df.astype({'col_name_2':'float64', 'col_name_3':'float64'})

When I've only needed to specify specific columns, and I want to be explicit, I've used (per pandas.DataFrame.astype):

dataframe = dataframe.astype({'col_name_1':'int','col_name_2':'float64', etc. ...})

So, using the original question, but providing column names to it...

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col_name_1', 'col_name_2', 'col_name_3'])
df = df.astype({'col_name_2':'float64', 'col_name_3':'float64'})
时光沙漏 2025-02-18 00:37:02

熊猫> = 1.0

这是一个图表,总结了熊猫中一些最重要的转换。

转换为字符串是微不足道的 .astype(str),并且图中未显示。

“硬”与“软”转换

请注意,在此上下文中的“转换”可以是指将文本数据转换为其实际数据类型(硬转换),也可以指定对象列中数据的更合适的数据类型(软转换)。为了说明差异,请看一下

df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes

a    object
b    object
dtype: object

# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes

a    int64
b    int64
dtype: object

# Infers better data types for object data - soft conversion
df.infer_objects().dtypes

a    object  # no change
b     int64
dtype: object

# Same as infer_objects, but converts to equivalent ExtensionType
    df.convert_dtypes().dtypes

pandas >= 1.0

Here's a chart that summarises some of the most important conversions in pandas.

Enter image description here

Conversions to string are trivial .astype(str) and are not shown in the figure.

"Hard" versus "Soft" conversions

Note that "conversions" in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at

df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes

a    object
b    object
dtype: object

# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes

a    int64
b    int64
dtype: object

# Infers better data types for object data - soft conversion
df.infer_objects().dtypes

a    object  # no change
b     int64
dtype: object

# Same as infer_objects, but converts to equivalent ExtensionType
    df.convert_dtypes().dtypes
箜明 2025-02-18 00:37:02
df = df.astype({"columnname": str})

#EG-用于将列类型更改为字符串
#DF是您的数据框

df = df.astype({"columnname": str})

#e.g - for changing the column type to string
#df is your dataframe

情话难免假 2025-02-18 00:37:02

这是一个函数,它作为参数占数据框和列列表,并将列中的所有数据驱动为数字。

# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

因此,以您的示例:

import pandas as pd

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])

coerce_df_columns_to_numeric(df, ['col2','col3'])

Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.

# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

So, for your example:

import pandas as pd

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])

coerce_df_columns_to_numeric(df, ['col2','col3'])
自我难过 2025-02-18 00:37:02

0。在转换后保存整数类型,

如Alex Riley的答案所示, pd.to_numeric(...,errors ='coerce')将整数转换为浮点。

要保留整数,您必须使用可无效的整数dtype,例如'int64'。一个选项是呼叫 .convert_dtypes()如Alex Riley的答案中。或者只使用 .astype('int64')

自PANDAS 2.0以来的另一个选项是传递 dtype_backend 参数,该参数允许您在一个函数调用中转换DTYPE。

df = pd.DataFrame({'A': ['#DIV/0!', 3, 5]})
df['A'] = pd.to_numeric(df['A'], errors='coerce').convert_dtypes()
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')

df['A'] = pd.to_numeric(df['A'], errors='coerce', dtype_backend='numpy_nullable')

以上所有这些都会进行以下转换:

“

的字符串表示,则将长浮子的字符串表示为数字值

1。如果列包含需要用精确评估的真正长浮子 ( float 将在15位数字和 pd.to_numeric 更加不精确之后将它们围住,然后使用 DECIMAL 从内置的 DECIMAL 库中使用。列的dtype将为对象,但 DECIMAL.DECIMAL 支持所有算术操作,因此您仍然可以执行诸如算术和比较操作员等矢量化操作。

from decimal import Decimal
df = pd.DataFrame({'long_float': ["0.1234567890123456789", "0.123456789012345678", "0.1234567890123456781"]})

df['w_float'] = df['long_float'].astype(float)       # imprecise
df['w_Decimal'] = df['long_float'].map(Decimal)      # precise

在上面的示例中, float 将所有这些转换为相同的数字,而 DECIMAL 保持差异:

df['w_Decimal'] == Decimal(df.loc[1, 'long_float'])   # False, True, False
df['w_float'] == float(df.loc[1, 'long_float'])       # True, True, True

将长整数的字符串表示为整数

2。默认情况下 , astype(int)转换为 int32 ,如果一个数字特别长(例如电话号码),则无法使用( OverFloperRor );尝试'int64'(甚至 float ):

df['long_num'] = df['long_num'].astype('int64')

在旁注,如果您获得 setterwithCopyWarning ,请打开折线,然后打开复印件。模式(请参阅此答案以获取更多信息),然后再做任何事情。例如,如果您将 col1 col2 转换为float dtype,则执行:

pd.set_option('mode.copy_on_write', True)
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)

# or use assign to overwrite the old columns and make a new copy
df = df.assign(**df[['col1', 'col2']].astype(float))

3。将整数转换为timeDERTA

,也可以将长字符串/整数date -dateTime或TimeDETTA,in在哪种情况下,请使用 to_datetime to_timedelta 转换为dateTime/timeDELTA dtype:

df = pd.DataFrame({'long_int': ['1018880886000000000', '1590305014000000000', '1101470895000000000', '1586646272000000000', '1460958607000000000']})
df['datetime'] = pd.to_datetime(df['long_int'].astype('int64'))
# or
df['datetime'] = pd.to_datetime(df['long_int'].astype(float))

df['timedelta'] = pd.to_timedelta(df['long_int'].astype('int64'))

“

4。将TIMSEDELTA转换为数字

,以执行反向操作(Convert dateTime/timeDelta到数字),将其视为'int64'。如果您要建立一个以某种方式将时间(或DateTime)作为数字值包含的机器学习模型,这可能很有用。只要确保原始数据是字符串,则必须将它们转换为timedelta或dateTime,然后再转换为数字。

df = pd.DataFrame({'Time diff': ['2 days 4:00:00', '3 days', '4 days', '5 days', '6 days']})
df['Time diff in nanoseconds'] = pd.to_timedelta(df['Time diff']).view('int64')
df['Time diff in seconds'] = pd.to_timedelta(df['Time diff']).view('int64') // 10**9
df['Time diff in hours'] = pd.to_timedelta(df['Time diff']).view('int64') // (3600*10**9)

5。将日期时间转换为日期时间的数字

,DateTime的数字视图是DateTime和Unix Epoch(1970-01-01)之间的时间差。

df = pd.DataFrame({'Date': ['2002-04-15', '2020-05-24', '2004-11-26', '2020-04-11', '2016-04-18']})
df['Time_since_unix_epoch'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').view('int64')

6。 astype to_numeric 更快

df = pd.DataFrame(np.random.default_rng().choice(1000, size=(10000, 50)).astype(str))
df = pd.concat([df, pd.DataFrame(np.random.rand(10000, 50).astype(str), columns=range(50, 100))], axis=1)

%timeit df.astype(dict.fromkeys(df.columns[:50], int) | dict.fromkeys(df.columns[50:], float))
# 488 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.apply(pd.to_numeric)
# 686 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

0. Preserve integer type after conversion

As can be seen in Alex Riley's answer, pd.to_numeric(..., errors='coerce') converts integers into floats.

To preserve the integers, you must use a Nullable Integer Dtype such as 'Int64'. One option for that is to call .convert_dtypes() as in Alex Riley's answer. Or just use .astype('Int64').

Another option available since pandas 2.0, is to pass the dtype_backend parameter, which allows you to convert the dtype in one function call.

df = pd.DataFrame({'A': ['#DIV/0!', 3, 5]})
df['A'] = pd.to_numeric(df['A'], errors='coerce').convert_dtypes()
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')

df['A'] = pd.to_numeric(df['A'], errors='coerce', dtype_backend='numpy_nullable')

All of the above make the following transformation:

result

1. Convert string representation of long floats to numeric values

If a column contains string representation of really long floats that need to be evaluated with precision (float would round them after 15 digits and pd.to_numeric is even more imprecise), then use Decimal from the builtin decimal library. The dtype of the column will be object but decimal.Decimal supports all arithmetic operations, so you can still perform vectorized operations such as arithmetic and comparison operators etc.

from decimal import Decimal
df = pd.DataFrame({'long_float': ["0.1234567890123456789", "0.123456789012345678", "0.1234567890123456781"]})

df['w_float'] = df['long_float'].astype(float)       # imprecise
df['w_Decimal'] = df['long_float'].map(Decimal)      # precise

res

In the example above, float converts all of them into the same number whereas Decimal maintains their difference:

df['w_Decimal'] == Decimal(df.loc[1, 'long_float'])   # False, True, False
df['w_float'] == float(df.loc[1, 'long_float'])       # True, True, True

2. Convert string representation of long integers to integers

By default, astype(int) converts to int32, which wouldn't work (OverflowError) if a number is particularly long (such as phone number); try 'int64' (or even float) instead:

df['long_num'] = df['long_num'].astype('int64')

On a side note, if you get SettingWithCopyWarning, then turn on copy-on-write mode (see this answer for more info) and do whatever you were doing again. For example, if you were converting col1 and col2 to float dtype, then do:

pd.set_option('mode.copy_on_write', True)
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)

# or use assign to overwrite the old columns and make a new copy
df = df.assign(**df[['col1', 'col2']].astype(float))

3. Convert integers to timedelta

Also, the long string/integer maybe datetime or timedelta, in which case, use to_datetime or to_timedelta to convert to datetime/timedelta dtype:

df = pd.DataFrame({'long_int': ['1018880886000000000', '1590305014000000000', '1101470895000000000', '1586646272000000000', '1460958607000000000']})
df['datetime'] = pd.to_datetime(df['long_int'].astype('int64'))
# or
df['datetime'] = pd.to_datetime(df['long_int'].astype(float))

df['timedelta'] = pd.to_timedelta(df['long_int'].astype('int64'))

res1

4. Convert timedelta to numbers

To perform the reverse operation (convert datetime/timedelta to numbers), view it as 'int64'. This could be useful if you were building a machine learning model that somehow needs to include time (or datetime) as a numeric value. Just make sure that if the original data are strings, then they must be converted to timedelta or datetime before any conversion to numbers.

df = pd.DataFrame({'Time diff': ['2 days 4:00:00', '3 days', '4 days', '5 days', '6 days']})
df['Time diff in nanoseconds'] = pd.to_timedelta(df['Time diff']).view('int64')
df['Time diff in seconds'] = pd.to_timedelta(df['Time diff']).view('int64') // 10**9
df['Time diff in hours'] = pd.to_timedelta(df['Time diff']).view('int64') // (3600*10**9)

res3

5. Convert datetime to numbers

For datetime, the numeric view of a datetime is the time difference between that datetime and the UNIX epoch (1970-01-01).

df = pd.DataFrame({'Date': ['2002-04-15', '2020-05-24', '2004-11-26', '2020-04-11', '2016-04-18']})
df['Time_since_unix_epoch'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').view('int64')

res4

6. astype is faster than to_numeric

df = pd.DataFrame(np.random.default_rng().choice(1000, size=(10000, 50)).astype(str))
df = pd.concat([df, pd.DataFrame(np.random.rand(10000, 50).astype(str), columns=range(50, 100))], axis=1)

%timeit df.astype(dict.fromkeys(df.columns[:50], int) | dict.fromkeys(df.columns[50:], float))
# 488 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.apply(pd.to_numeric)
# 686 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
情深如许 2025-02-18 00:37:02

两个数据帧,每个列片都有不同的列数据类型,然后将它们共同附加在一起:

d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))

Result

In[8}:  d1.dtypes
Out[8]:
float_column     float64
string_column     object
dtype: object

创建 您想要的任何数据类型)在第二列中。

Create two dataframes, each with different data types for their columns, and then appending them together:

d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))

Results

In[8}:  d1.dtypes
Out[8]:
float_column     float64
string_column     object
dtype: object

After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.

木落 2025-02-18 00:37:02

df.info()为我们提供了float64的初始数据类型

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    132 non-null    object 
 1   temp    132 non-null    float64

,现在使用此代码将数据类型更改为int64:

df['temp'] = df['temp'].astype('int64')

如果您再次执行df.info(),您会看到:

  #   Column  Non-Null Count  Dtype 
 ---  ------  --------------  ----- 
  0   date    132 non-null    object
  1   temp    132 non-null    int64 

这显示您已成功更改了列的数据标准温度。愉快的编码!

df.info() gives us initial datatype of temp which is float64

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    132 non-null    object 
 1   temp    132 non-null    float64

Now, use this code to change the datatype to int64:

df['temp'] = df['temp'].astype('int64')

if you do df.info() again, you will see:

  #   Column  Non-Null Count  Dtype 
 ---  ------  --------------  ----- 
  0   date    132 non-null    object
  1   temp    132 non-null    int64 

This shows you have successfully changed the datatype of column temp. Happy coding!

椒妓 2025-02-18 00:37:02

从Pandas 1.0.0开始,我们有 pandas.dataframe.convert_dtypes 。您甚至可以控制要转换的类型!

In [40]: df = pd.DataFrame(
    ...:     {
    ...:         "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
    ...:         "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
    ...:         "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
    ...:         "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
    ...:         "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
    ...:         "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
    ...:     }
    ...: )

In [41]: dff = df.copy()

In [42]: df 
Out[42]: 
   a  b      c    d     e      f
0  1  x   True    h  10.0    NaN
1  2  y  False    i   NaN  100.5
2  3  z    NaN  NaN  20.0  200.0

In [43]: df.dtypes
Out[43]: 
a      int32
b     object
c     object
d     object
e    float64
f    float64
dtype: object

In [44]: df = df.convert_dtypes()

In [45]: df.dtypes
Out[45]: 
a      Int32
b     string
c    boolean
d     string
e      Int64
f    float64
dtype: object

In [46]: dff = dff.convert_dtypes(convert_boolean = False)

In [47]: dff.dtypes
Out[47]: 
a      Int32
b     string
c     object
d     string
e      Int64
f    float64
dtype: object

Starting pandas 1.0.0, we have pandas.DataFrame.convert_dtypes. You can even control what types to convert!

In [40]: df = pd.DataFrame(
    ...:     {
    ...:         "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
    ...:         "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
    ...:         "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
    ...:         "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
    ...:         "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
    ...:         "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
    ...:     }
    ...: )

In [41]: dff = df.copy()

In [42]: df 
Out[42]: 
   a  b      c    d     e      f
0  1  x   True    h  10.0    NaN
1  2  y  False    i   NaN  100.5
2  3  z    NaN  NaN  20.0  200.0

In [43]: df.dtypes
Out[43]: 
a      int32
b     object
c     object
d     object
e    float64
f    float64
dtype: object

In [44]: df = df.convert_dtypes()

In [45]: df.dtypes
Out[45]: 
a      Int32
b     string
c    boolean
d     string
e      Int64
f    float64
dtype: object

In [46]: dff = dff.convert_dtypes(convert_boolean = False)

In [47]: dff.dtypes
Out[47]: 
a      Int32
b     string
c     object
d     string
e      Int64
f    float64
dtype: object
冰雪梦之恋 2025-02-18 00:37:02

如果您有各种对象列,例如74个对象列的此数据框和2列的2列,其中每个值都有代表单元的字母:

import pandas as pd 
import numpy as np

dataurl = 'https://raw.githubusercontent.com/RubenGavidia/Pandas_Portfolio.py/main/Wes_Mckinney.py/nutrition.csv'
nutrition = pd.read_csv(dataurl,index_col=[0])
nutrition.head(3)

输出:

    name    serving_size    calories    total_fat    saturated_fat    cholesterol    sodium    choline    folate    folic_acid    ...    fat    saturated_fatty_acids    monounsaturated_fatty_acids    polyunsaturated_fatty_acids    fatty_acids_total_trans    alcohol    ash    caffeine    theobromine    water
0    Cornstarch    100 g    381    0.1g    NaN    0    9.00 mg    0.4 mg    0.00 mcg    0.00 mcg    ...    0.05 g    0.009 g    0.016 g    0.025 g    0.00 mg    0.0 g    0.09 g    0.00 mg    0.00 mg    8.32 g
1    Nuts, pecans    100 g    691    72g    6.2g    0    0.00 mg    40.5 mg    22.00 mcg    0.00 mcg    ...    71.97 g    6.180 g    40.801 g    21.614 g    0.00 mg    0.0 g    1.49 g    0.00 mg    0.00 mg    3.52 g
2    Eggplant, raw    100 g    25    0.2g    NaN    0    2.00 mg    6.9 mg    22.00 mcg    0.00 mcg    ...    0.18 g    0.034 g    0.016 g    0.076 g    0.00 mg    0.0 g    0.66 g    0.00 mg    0.00 mg    92.30 g
3 rows × 76 columns

nutrition.dtypes
name             object
serving_size     object
calories          int64
total_fat        object
saturated_fat    object
                  ...
alcohol          object
ash              object
caffeine         object
theobromine      object
water            object
Length: 76, dtype: object

nutrition.dtypes.value_counts()
object    74
int64      2
dtype: int64

转换为数字的好方法所有列使用正则表达式来替换单位nothing and astype( float)为了将列数据类型更改为float:

nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
nutrition.head(3)

输出:

serving_size    calories    total_fat    saturated_fat    cholesterol    sodium    choline    folate    folic_acid    niacin    ...    fat    saturated_fatty_acids    monounsaturated_fatty_acids    polyunsaturated_fatty_acids    fatty_acids_total_trans    alcohol    ash    caffeine    theobromine    water
name
Cornstarch    100.0    381.0    0.1    NaN    0.0    9.0    0.4    0.0    0.0    0.000    ...    0.05    0.009    0.016    0.025    0.0    0.0    0.09    0.0    0.0    8.32
Nuts, pecans    100.0    691.0    72.0    6.2    0.0    0.0    40.5    22.0    0.0    1.167    ...    71.97    6.180    40.801    21.614    0.0    0.0    1.49    0.0    0.0    3.52
Eggplant, raw    100.0    25.0    0.2    NaN    0.0    2.0    6.9    22.0    0.0    0.649    ...    0.18    0.034    0.016    0.076    0.0    0.0    0.66    0.0    0.0    92.30
3 rows × 75 columns

nutrition.dtypes
serving_size     float64
calories         float64
total_fat        float64
saturated_fat    float64
cholesterol      float64
                  ...
alcohol          float64
ash              float64
caffeine         float64
theobromine      float64
water            float64
Length: 75, dtype: object

nutrition.dtypes.value_counts()
float64    75
dtype: int64

现在数据集很干净,您只能使用Regex和Astype()使用此数据框进行数字操作。

如果要收集单元并粘贴在标题上,例如胆固醇_mg可以使用此代码:

nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = units.mode()
units = units.replace('', np.nan).dropna(axis=1)
mapper = { k: k + "_" + units[k].at[0] for k in units}
nutrition.rename(columns=mapper, inplace=True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)

In case you have various objects columns like this Dataframe of 74 Objects columns and 2 Int columns where each value have letters representing units:

import pandas as pd 
import numpy as np

dataurl = 'https://raw.githubusercontent.com/RubenGavidia/Pandas_Portfolio.py/main/Wes_Mckinney.py/nutrition.csv'
nutrition = pd.read_csv(dataurl,index_col=[0])
nutrition.head(3)

Output:

    name    serving_size    calories    total_fat    saturated_fat    cholesterol    sodium    choline    folate    folic_acid    ...    fat    saturated_fatty_acids    monounsaturated_fatty_acids    polyunsaturated_fatty_acids    fatty_acids_total_trans    alcohol    ash    caffeine    theobromine    water
0    Cornstarch    100 g    381    0.1g    NaN    0    9.00 mg    0.4 mg    0.00 mcg    0.00 mcg    ...    0.05 g    0.009 g    0.016 g    0.025 g    0.00 mg    0.0 g    0.09 g    0.00 mg    0.00 mg    8.32 g
1    Nuts, pecans    100 g    691    72g    6.2g    0    0.00 mg    40.5 mg    22.00 mcg    0.00 mcg    ...    71.97 g    6.180 g    40.801 g    21.614 g    0.00 mg    0.0 g    1.49 g    0.00 mg    0.00 mg    3.52 g
2    Eggplant, raw    100 g    25    0.2g    NaN    0    2.00 mg    6.9 mg    22.00 mcg    0.00 mcg    ...    0.18 g    0.034 g    0.016 g    0.076 g    0.00 mg    0.0 g    0.66 g    0.00 mg    0.00 mg    92.30 g
3 rows × 76 columns

nutrition.dtypes
name             object
serving_size     object
calories          int64
total_fat        object
saturated_fat    object
                  ...
alcohol          object
ash              object
caffeine         object
theobromine      object
water            object
Length: 76, dtype: object

nutrition.dtypes.value_counts()
object    74
int64      2
dtype: int64

A good way to convert to numeric all columns is using regular expressions to replace the units for nothing and astype(float) for change the columns data type to float:

nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
nutrition.head(3)

Output:

serving_size    calories    total_fat    saturated_fat    cholesterol    sodium    choline    folate    folic_acid    niacin    ...    fat    saturated_fatty_acids    monounsaturated_fatty_acids    polyunsaturated_fatty_acids    fatty_acids_total_trans    alcohol    ash    caffeine    theobromine    water
name
Cornstarch    100.0    381.0    0.1    NaN    0.0    9.0    0.4    0.0    0.0    0.000    ...    0.05    0.009    0.016    0.025    0.0    0.0    0.09    0.0    0.0    8.32
Nuts, pecans    100.0    691.0    72.0    6.2    0.0    0.0    40.5    22.0    0.0    1.167    ...    71.97    6.180    40.801    21.614    0.0    0.0    1.49    0.0    0.0    3.52
Eggplant, raw    100.0    25.0    0.2    NaN    0.0    2.0    6.9    22.0    0.0    0.649    ...    0.18    0.034    0.016    0.076    0.0    0.0    0.66    0.0    0.0    92.30
3 rows × 75 columns

nutrition.dtypes
serving_size     float64
calories         float64
total_fat        float64
saturated_fat    float64
cholesterol      float64
                  ...
alcohol          float64
ash              float64
caffeine         float64
theobromine      float64
water            float64
Length: 75, dtype: object

nutrition.dtypes.value_counts()
float64    75
dtype: int64

Now the dataset is clean and you are able to do numeric operations with this Dataframe only with regex and astype().

If you want to collect the units and paste on the headers like cholesterol_mg you can use this code:

nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = units.mode()
units = units.replace('', np.nan).dropna(axis=1)
mapper = { k: k + "_" + units[k].at[0] for k in units}
nutrition.rename(columns=mapper, inplace=True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
孤千羽 2025-02-18 00:37:02

我也有同样的问题。

我找不到令人满意的解决方案。我的解决方案只是将这些漂浮物转换为str并以这种方式删除“ .0”。

就我而言,我只将其应用于第一列:

firstCol = list(df.columns)[0]
df[firstCol] = df[firstCol].fillna('').astype(str).apply(lambda x: x.replace('.0', ''))

I had the same issue.

I could not find any solution that was satisfying. My solution was simply to convert those float into str and remove the '.0' this way.

In my case, I just apply it on the first column:

firstCol = list(df.columns)[0]
df[firstCol] = df[firstCol].fillna('').astype(str).apply(lambda x: x.replace('.0', ''))
三生一梦 2025-02-18 00:37:02

有没有办法在转换为dataFrame时指定类型?

是的。其他答案在创建数据框后会转换DTYPE,但是我们可以指定创建时的类型。使用 dataframe.from_records read_csv(dtype = ...)取决于输入格式。

后者有时需要避免使用大数据避免内存错误。


1。 a>

从a

x = [['foo', '1.2', '70'], ['bar', '4.2', '5']]

df = pd.DataFrame.from_records(np.array(
    [tuple(row) for row in x], # pass a list-of-tuples (x can be a list-of-lists or 2D array)
    'object, float, int'       # define the column types
))

输出:

>>> df.dtypes
# f0     object
# f1    float64
# f2      int64
# dtype: object

2。 ...)

如果您是从文件中读取数据时间。

例如,在这里,我们将30m行读取等级为8位整数,而 genre at extorical:

lines = '''
foo,biography,5
bar,crime,4
baz,fantasy,3
qux,history,2
quux,horror,1
'''
columns = ['name', 'genre', 'rating']
csv = io.StringIO(lines * 6_000_000) # 30M lines

df = pd.read_csv(csv, names=columns, dtype={'rating': 'int8', 'genre': 'category'})

在这种情况下,我们在加载时将内存使用量减半:

>>> df.info(memory_usage='deep')
# memory usage: 1.8 GB
>>> pd.read_csv(io.StringIO(lines * 6_000_000)).info(memory_usage='deep')
# memory usage: 3.7 GB

这是一种方式到<避免使用大数据的内存错误。加载后,并非总是可以更改dtypes ,因为我们可能没有足够的内存来加载默认类型的数据。

Is there a way to specify the types while converting to DataFrame?

Yes. The other answers convert the dtypes after creating the DataFrame, but we can specify the types at creation. Use either DataFrame.from_records or read_csv(dtype=...) depending on the input format.

The latter is sometimes necessary to avoid memory errors with big data.


1. DataFrame.from_records

Create the DataFrame from a structured array of the desired column types:

x = [['foo', '1.2', '70'], ['bar', '4.2', '5']]

df = pd.DataFrame.from_records(np.array(
    [tuple(row) for row in x], # pass a list-of-tuples (x can be a list-of-lists or 2D array)
    'object, float, int'       # define the column types
))

Output:

>>> df.dtypes
# f0     object
# f1    float64
# f2      int64
# dtype: object

2. read_csv(dtype=...)

If you're reading the data from a file, use the dtype parameter of read_csv to set the column types at load time.

For example, here we read 30M rows with rating as 8-bit integers and genre as categorical:

lines = '''
foo,biography,5
bar,crime,4
baz,fantasy,3
qux,history,2
quux,horror,1
'''
columns = ['name', 'genre', 'rating']
csv = io.StringIO(lines * 6_000_000) # 30M lines

df = pd.read_csv(csv, names=columns, dtype={'rating': 'int8', 'genre': 'category'})

In this case, we halve the memory usage upon load:

>>> df.info(memory_usage='deep')
# memory usage: 1.8 GB
>>> pd.read_csv(io.StringIO(lines * 6_000_000)).info(memory_usage='deep')
# memory usage: 3.7 GB

This is one way to avoid memory errors with big data. It's not always possible to change the dtypes after loading since we might not have enough memory to load the default-typed data in the first place.

任性一次 2025-02-18 00:37:02

我以为我遇到了同样的问题,但实际上我有一个略有不同的差异,使问题更容易解决。对于其他查看这个问题的人来说,值得检查输入列表的格式。在我的情况下,数字最初是浮子,而不是像问题那样的字符串:

a = [['a', 1.2, 4.2], ['b', 70, 0.03], ['x', 5, 0]]

而是通过在创建数据框之前处理列表过多,我会失去类型,一切都变成了字符串。

通过a

df = pd.DataFrame(np.array(a))
df

Out[5]:
   0    1     2
0  a  1.2   4.2
1  b   70  0.03
2  x    5     0

df[1].dtype
Out[7]: dtype('O')

。第1列和第2列中的条目被视为字符串。但是,确实

df = pd.DataFrame(a)

df
Out[10]:
   0     1     2
0  a   1.2  4.20
1  b  70.0  0.03
2  x   5.0  0.00

df[1].dtype
Out[11]: dtype('float64')

以正确格式的列确实提供了一个数据框架。

I thought I had the same problem, but actually I have a slight difference that makes the problem easier to solve. For others looking at this question, it's worth checking the format of your input list. In my case the numbers are initially floats, not strings as in the question:

a = [['a', 1.2, 4.2], ['b', 70, 0.03], ['x', 5, 0]]

But by processing the list too much before creating the dataframe, I lose the types and everything becomes a string.

Creating the data frame via a NumPy array:

df = pd.DataFrame(np.array(a))
df

Out[5]:
   0    1     2
0  a  1.2   4.2
1  b   70  0.03
2  x    5     0

df[1].dtype
Out[7]: dtype('O')

gives the same data frame as in the question, where the entries in columns 1 and 2 are considered as strings. However doing

df = pd.DataFrame(a)

df
Out[10]:
   0     1     2
0  a   1.2  4.20
1  b  70.0  0.03
2  x   5.0  0.00

df[1].dtype
Out[11]: dtype('float64')

does actually give a data frame with the columns in the correct format.

初见终念 2025-02-18 00:37:02

如果要从字符串格式转换一列,建议使用此代码“

import pandas as pd
#My Test Data
data = {'Product': ['A','B', 'C','D'],
          'Price': ['210','250', '320','280']}
data


#Create Data Frame from My data df = pd.DataFrame(data)

#Convert to number
df['Price'] = pd.to_numeric(df['Price'])
df

Total = sum(df['Price'])
Total

否则,如果要将许多列值转换为数字,我建议您首先过滤您的值并保存在空数字中,然后将其转换为数字。我希望此代码解决了您的问题。

If you want convert one column from string format I suggest use this code"

import pandas as pd
#My Test Data
data = {'Product': ['A','B', 'C','D'],
          'Price': ['210','250', '320','280']}
data


#Create Data Frame from My data df = pd.DataFrame(data)

#Convert to number
df['Price'] = pd.to_numeric(df['Price'])
df

Total = sum(df['Price'])
Total

else if you going to convert a number of column values to number I suggest to you first filter your values and save in empty array and after that convert to number. I hope this code solve your problem.

泅渡 2025-02-18 00:37:01

您有四个主要的选择用于熊猫中的类型:

  1. noreferrer“ to_numeric() - 提供可安全转换非数字类型(例如字符串)为合适数字类型的功能。 (另请参见 to_datempetime() and to_timedeltaTa(to_timedelta)()


  2. astype> astype(Astype() a> - 几乎将任何类型转换为(几乎)任何其他类型(即使不一定明智地这样做)。还允许您转换为 exporial> exporial 类型类型(非常有用)。

  3. - 一种实用方法,将python对象持有对象列转换为熊猫类型。


  4. noreferrer“ a> - 将数据框列转换为支持 pd.na 的“最佳可能” dtype(pandas的对象表示缺失值)。


请继续阅读以获取更多这些方法的详细说明和使用。


1。 to_numeric()

将一个或多个数据框架转换为数字值的最佳方法是使用 pandas.to_numeric()

此功能将尝试将非数字对象(例如字符串)更改为整数或浮点数。

基本用法

A>是数据框的系列或单列。

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

如您所见,返回了一个新系列。请记住将此输出分配给变量或列名继续使用它:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

您还可以使用它通过 apply() 方法:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

只要您的值都可以转换,这可能就是所有您需要的。

错误处理

,但是如果某些值不能转换为数字类型怎么办?

to_numeric() 错误关键字参数允许您强制非数字值为 nan ,或者简单地忽略包含这些值的列。

这是一个使用一系列字符串 s 的示例,该 s 具有对象dtype:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

如果无法转换值,则默认行为是为了提高行为。在这种情况下,它不能应对字符串的“熊猫”:

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

而不是失败,我们可能希望“熊猫”被视为缺失/错误的数字值。我们可以使用 errors 关键字参数:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

errors 的第三个选项,我们可以将无效的值胁到 nan , 如下遇到值:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

最后一个选项对于转换整个数据框特别有用,但不知道我们的哪个列可以可靠地转换为数字类型。在这种情况下,只需写入:

df.apply(pd.to_numeric, errors='ignore')

该函数将应用于数据框的每一列。可以转换为数字类型的列将被转换,而不能单独保留无法(例如它们包含非数字字符串或日期)的列(例如它们包含非数字字符串或日期)。

默认情况下

,以 to_numeric() /a>将为您提供 int64 float64 dtype(或平台本地的任何整数宽度)。

通常是您想要的,但是如果您想保存一些内存并使用更紧凑的DTYPE,例如 float32 int8 ,该怎么办?

to_numeric() 选项降低到'整数''aigned''unsigned''float'。这是整数类型的简单系列 s 的示例:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

降低到'integer'使用可以保持值的最小整数:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

降低到 'float' 类似地选择比正常浮动类型要小:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32

2。 astype()

astype() 方法使您可以明确您想要数据框架或系列的dtype。它的通用性非常多,您可以尝试从一种类型转到其他类型。

基本用法

只需选择一种类型:您可以使用numpy dtype(例如 np.int16 ),一些python类型(例如bool)或pandas特定类型(例如分类dtype)。

在要转换的对象上调用该方法,然后 将尝试为您转换它:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

注意我说“尝试” - 如果 astype() 不知道如何转换系列或数据框中的值,它将引起错误。例如,如果您有一个 nan inf 值,您将获得一个错误,试图将其转换为整数。

从PANDAS 0.20.0开始,可以通过传递 errors ='gignore'来抑制此错误。您的原始对象将被返回。

请小心

astype() >功能强大,但有时会“错误地”转换值。例如:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

这些是小整数,那么如何转换为未签名的8位类型以节省内存?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

转换起作用,但是-7被包裹为249(即2 8 -7)!

尝试使用 pd.to_numeric(s,downcast ='unsigned')尝试降落可以帮助防止此错误。


3。 phy_objects()

pandas的0.21.0版本介绍了方法 peasun_objects() 用于将对象数据类型具有更特定类型(软转换)的数据帧的转换列。

例如,这是一个具有两列对象类型的数据框架。一个人持有实际的整数,另一个则保留代表整数的字符串:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

使用 exele_objects() ,您可以将'a''''''''''''''''''''''''''''a''a'''to'a'''to'to'a'':b'列

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

'b'更改为单独,因为它的值是字符串,而不是整数。如果您想将这两个列强制为整数类型,则可以改用 df.astype(int)


4。 convert_dtypes()

版本1.0及更高版本包括一种方法 convert_dtypes() 将系列和数据框列转换为支持 pd.na 缺失值的最佳dtype。

在这里,“最佳可能”是指最适合保持值的类型。例如,这是熊猫整数类型,如果所有值都是整数(或缺失值):将Python Integer对象的对象列转换为 int64 ,numpy int32 值,将成为pandas dtype int32

使用我们的对象 dataframe df ,我们得到以下结果:

>>> df.convert_dtypes().dtypes                                             
a     Int64
b    string
dtype: object

由于列''a'''''hold整数值,它被转换为 int64 类型(与 int64 不同,它能够保持缺失值。

列“ b”包含字符串对象,因此已更改为pandas'字符串 dtype。

默认情况下,此方法将从每列中的对象值中推断出类型。我们可以通过传递 exex_Objects = false

>>> df.convert_dtypes(infer_objects=False).dtypes                          
a    object
b    string
dtype: object

现在列'a'保留对象列:Pandas知道它可以被描述为“整数”列(内部ran

You have four main options for converting types in pandas:

  1. to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)

  2. astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

  4. convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.


1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.

Basic usage

The input to to_numeric() is a Series or a single column of a DataFrame.

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that's probably all you need.

Error handling

But what if some values can't be converted to a numeric type?

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

Here's an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:

df.apply(pd.to_numeric, errors='ignore')

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

Downcasting

By default, conversion with to_numeric() will give you either an int64 or float64 dtype (or whatever integer width is native to your platform).

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to 'integer' uses the smallest possible integer that can hold the values:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

Downcasting to 'float' similarly picks a smaller than normal floating type:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32

2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.

Basic usage

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and astype() will try and convert it for you:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have a NaN or inf value you'll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be returned untouched.

Be careful

astype() is powerful, but it will sometimes convert values "incorrectly". For example:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.


3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects(), you can change the type of column 'a' to int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use df.astype(int) instead.


4. convert_dtypes()

Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to Int64, a column of NumPy int32 values, will become the pandas dtype Int32.

With our object DataFrame df, we get the following result:

>>> df.convert_dtypes().dtypes                                             
a     Int64
b    string
dtype: object

Since column 'a' held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).

Column 'b' contained string objects, so was changed to pandas' string dtype.

By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:

>>> df.convert_dtypes(infer_objects=False).dtypes                          
a    object
b    string
dtype: object

Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文