我从列表列表中创建了一个数据框:
table = [
['a', '1.2', '4.2' ],
['b', '70', '0.03'],
['x', '5', '0' ],
]
df = pd.DataFrame(table)
如何将列转换为特定类型?在这种情况下,我想将第2和3列转换为浮子。
有没有办法在将列表转换为数据框时指定类型?还是最好先创建数据帧,然后循环循环以更改每列的DTYPE?理想情况下,我想以动态的方式进行此操作,因为可以有数百列,而且我不想确切指定哪种类型的列。我只能保证,每列包含相同类型的值。
I created a DataFrame from a list of lists:
table = [
['a', '1.2', '4.2' ],
['b', '70', '0.03'],
['x', '5', '0' ],
]
df = pd.DataFrame(table)
How do I convert the columns to specific types? In this case, I want to convert columns 2 and 3 into floats.
Is there a way to specify the types while converting the list to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the dtype for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns, and I don't want to specify exactly which columns are of which type. All I can guarantee is that each column contains values of the same type.
发布评论
评论(16)
使用以下方式:
Use this:
以下代码将更改列的数据类型。
代替数据类型,您可以为数据类型提供所需的内容,例如str,float,int等。
This below code will change the datatype of a column.
In place of the data type, you can give your datatype what you want, like, str, float, int, etc.
当我只需要指定特定的列而要明确时,我已经使用过(per pandas.dataframe.astype ):
因此,使用原始问题,但为其提供列名...
When I've only needed to specify specific columns, and I want to be explicit, I've used (per pandas.DataFrame.astype):
So, using the original question, but providing column names to it...
熊猫> = 1.0
这是一个图表,总结了熊猫中一些最重要的转换。
转换为字符串是微不足道的
.astype(str)
,并且图中未显示。“硬”与“软”转换
请注意,在此上下文中的“转换”可以是指将文本数据转换为其实际数据类型(硬转换),也可以指定对象列中数据的更合适的数据类型(软转换)。为了说明差异,请看一下
pandas >= 1.0
Here's a chart that summarises some of the most important conversions in pandas.
Conversions to string are trivial
.astype(str)
and are not shown in the figure."Hard" versus "Soft" conversions
Note that "conversions" in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at
#EG-用于将列类型更改为字符串
#DF是您的数据框
#e.g - for changing the column type to string
#df is your dataframe
这是一个函数,它作为参数占数据框和列列表,并将列中的所有数据驱动为数字。
因此,以您的示例:
Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.
So, for your example:
0。在转换后保存整数类型,
如Alex Riley的答案所示,
pd.to_numeric(...,errors ='coerce')
将整数转换为浮点。要保留整数,您必须使用可无效的整数dtype,例如
'int64'
。一个选项是呼叫.convert_dtypes()
如Alex Riley的答案中。或者只使用.astype('int64')
。自PANDAS 2.0以来的另一个选项是传递
dtype_backend
参数,该参数允许您在一个函数调用中转换DTYPE。以上所有这些都会进行以下转换:
的字符串表示,则将长浮子的字符串表示为数字值
1。如果列包含需要用精确评估的真正长浮子 (
float 将在15位数字和
pd.to_numeric
更加不精确之后将它们围住,然后使用DECIMAL
从内置的DECIMAL
库中使用。列的dtype将为对象
,但DECIMAL.DECIMAL
支持所有算术操作,因此您仍然可以执行诸如算术和比较操作员等矢量化操作。在上面的示例中,
float
将所有这些转换为相同的数字,而DECIMAL
保持差异:将长整数的字符串表示为整数
2。默认情况下 ,
astype(int)转换为
int32
,如果一个数字特别长(例如电话号码),则无法使用(OverFloperRor
);尝试'int64'
(甚至float
):在旁注,如果您获得
setterwithCopyWarning
,请打开折线,然后打开复印件。模式(请参阅此答案以获取更多信息),然后再做任何事情。例如,如果您将col1
和col2
转换为float dtype,则执行:3。将整数转换为timeDERTA
,也可以将长字符串/整数date -dateTime或TimeDETTA,in在哪种情况下,请使用
to_datetime
或to_timedelta
转换为dateTime/timeDELTA dtype:4。将TIMSEDELTA转换为数字
,以执行反向操作(Convert dateTime/timeDelta到数字),将其视为
'int64'
。如果您要建立一个以某种方式将时间(或DateTime)作为数字值包含的机器学习模型,这可能很有用。只要确保原始数据是字符串,则必须将它们转换为timedelta或dateTime,然后再转换为数字。5。将日期时间转换为日期时间的数字
,DateTime的数字视图是DateTime和Unix Epoch(1970-01-01)之间的时间差。
6。
astype
比to_numeric
更快0. Preserve integer type after conversion
As can be seen in Alex Riley's answer,
pd.to_numeric(..., errors='coerce')
converts integers into floats.To preserve the integers, you must use a Nullable Integer Dtype such as
'Int64'
. One option for that is to call.convert_dtypes()
as in Alex Riley's answer. Or just use.astype('Int64')
.Another option available since pandas 2.0, is to pass the
dtype_backend
parameter, which allows you to convert the dtype in one function call.All of the above make the following transformation:
1. Convert string representation of long floats to numeric values
If a column contains string representation of really long floats that need to be evaluated with precision (
float
would round them after 15 digits andpd.to_numeric
is even more imprecise), then useDecimal
from the builtindecimal
library. The dtype of the column will beobject
butdecimal.Decimal
supports all arithmetic operations, so you can still perform vectorized operations such as arithmetic and comparison operators etc.In the example above,
float
converts all of them into the same number whereasDecimal
maintains their difference:2. Convert string representation of long integers to integers
By default,
astype(int)
converts toint32
, which wouldn't work (OverflowError
) if a number is particularly long (such as phone number); try'int64'
(or evenfloat
) instead:On a side note, if you get
SettingWithCopyWarning
, then turn on copy-on-write mode (see this answer for more info) and do whatever you were doing again. For example, if you were convertingcol1
andcol2
to float dtype, then do:3. Convert integers to timedelta
Also, the long string/integer maybe datetime or timedelta, in which case, use
to_datetime
orto_timedelta
to convert to datetime/timedelta dtype:4. Convert timedelta to numbers
To perform the reverse operation (convert datetime/timedelta to numbers), view it as
'int64'
. This could be useful if you were building a machine learning model that somehow needs to include time (or datetime) as a numeric value. Just make sure that if the original data are strings, then they must be converted to timedelta or datetime before any conversion to numbers.5. Convert datetime to numbers
For datetime, the numeric view of a datetime is the time difference between that datetime and the UNIX epoch (1970-01-01).
6.
astype
is faster thanto_numeric
两个数据帧,每个列片都有不同的列数据类型,然后将它们共同附加在一起:
Result
创建 您想要的任何数据类型)在第二列中。
Create two dataframes, each with different data types for their columns, and then appending them together:
Results
After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.
df.info()为我们提供了float64的初始数据类型
,现在使用此代码将数据类型更改为int64:
如果您再次执行df.info(),您会看到:
这显示您已成功更改了列的数据标准温度。愉快的编码!
df.info() gives us initial datatype of temp which is float64
Now, use this code to change the datatype to int64:
if you do df.info() again, you will see:
This shows you have successfully changed the datatype of column temp. Happy coding!
从Pandas 1.0.0开始,我们有
pandas.dataframe.convert_dtypes
。您甚至可以控制要转换的类型!Starting pandas 1.0.0, we have
pandas.DataFrame.convert_dtypes
. You can even control what types to convert!如果您有各种对象列,例如74个对象列的此数据框和2列的2列,其中每个值都有代表单元的字母:
输出:
转换为数字的好方法所有列使用正则表达式来替换单位nothing and astype( float)为了将列数据类型更改为float:
输出:
现在数据集很干净,您只能使用Regex和Astype()使用此数据框进行数字操作。
如果要收集单元并粘贴在标题上,例如
胆固醇_mg
可以使用此代码:In case you have various objects columns like this Dataframe of 74 Objects columns and 2 Int columns where each value have letters representing units:
Output:
A good way to convert to numeric all columns is using regular expressions to replace the units for nothing and astype(float) for change the columns data type to float:
Output:
Now the dataset is clean and you are able to do numeric operations with this Dataframe only with regex and astype().
If you want to collect the units and paste on the headers like
cholesterol_mg
you can use this code:我也有同样的问题。
我找不到令人满意的解决方案。我的解决方案只是将这些漂浮物转换为str并以这种方式删除“ .0”。
就我而言,我只将其应用于第一列:
I had the same issue.
I could not find any solution that was satisfying. My solution was simply to convert those float into str and remove the '.0' this way.
In my case, I just apply it on the first column:
是的。其他答案在创建数据框后会转换DTYPE,但是我们可以指定创建时的类型。使用
dataframe.from_records
或read_csv(dtype = ...)
取决于输入格式。后者有时需要避免使用大数据避免内存错误。
1。 a>
从a
输出:
2。 ...)
如果您是从文件中读取数据时间。
例如,在这里,我们将30m行读取
等级
为8位整数,而genre
at extorical:在这种情况下,我们在加载时将内存使用量减半:
这是一种方式到<避免使用大数据的内存错误。加载后,并非总是可以更改dtypes ,因为我们可能没有足够的内存来加载默认类型的数据。
Yes. The other answers convert the dtypes after creating the DataFrame, but we can specify the types at creation. Use either
DataFrame.from_records
orread_csv(dtype=...)
depending on the input format.The latter is sometimes necessary to avoid memory errors with big data.
1.
DataFrame.from_records
Create the DataFrame from a structured array of the desired column types:
Output:
2.
read_csv(dtype=...)
If you're reading the data from a file, use the
dtype
parameter ofread_csv
to set the column types at load time.For example, here we read 30M rows with
rating
as 8-bit integers andgenre
as categorical:In this case, we halve the memory usage upon load:
This is one way to avoid memory errors with big data. It's not always possible to change the dtypes after loading since we might not have enough memory to load the default-typed data in the first place.
我以为我遇到了同样的问题,但实际上我有一个略有不同的差异,使问题更容易解决。对于其他查看这个问题的人来说,值得检查输入列表的格式。在我的情况下,数字最初是浮子,而不是像问题那样的字符串:
而是通过在创建数据框之前处理列表过多,我会失去类型,一切都变成了字符串。
通过a
。第1列和第2列中的条目被视为字符串。但是,确实
以正确格式的列确实提供了一个数据框架。
I thought I had the same problem, but actually I have a slight difference that makes the problem easier to solve. For others looking at this question, it's worth checking the format of your input list. In my case the numbers are initially floats, not strings as in the question:
But by processing the list too much before creating the dataframe, I lose the types and everything becomes a string.
Creating the data frame via a NumPy array:
gives the same data frame as in the question, where the entries in columns 1 and 2 are considered as strings. However doing
does actually give a data frame with the columns in the correct format.
如果要从字符串格式转换一列,建议使用此代码“
否则,如果要将许多列值转换为数字,我建议您首先过滤您的值并保存在空数字中,然后将其转换为数字。我希望此代码解决了您的问题。
If you want convert one column from string format I suggest use this code"
else if you going to convert a number of column values to number I suggest to you first filter your values and save in empty array and after that convert to number. I hope this code solve your problem.
您有四个主要的选择用于熊猫中的类型:
noreferrer“ to_numeric() - 提供可安全转换非数字类型(例如字符串)为合适数字类型的功能。 (另请参见
to_datempetime()
and
to_timedeltaTa(to_timedelta)()
astype> astype(Astype() a> - 几乎将任何类型转换为(几乎)任何其他类型(即使不一定明智地这样做)。还允许您转换为
exporial> exporial 类型类型(非常有用)。
。
- 一种实用方法,将python对象持有对象列转换为熊猫类型。
noreferrer“ a> - 将数据框列转换为支持
pd.na
的“最佳可能” dtype(pandas的对象表示缺失值)。请继续阅读以获取更多这些方法的详细说明和使用。
1。
to_numeric()
将一个或多个数据框架转换为数字值的最佳方法是使用
pandas.to_numeric()
。此功能将尝试将非数字对象(例如字符串)更改为整数或浮点数。
基本用法
A>是数据框的系列或单列。
如您所见,返回了一个新系列。请记住将此输出分配给变量或列名继续使用它:
您还可以使用它通过
apply()
方法:只要您的值都可以转换,这可能就是所有您需要的。
错误处理
,但是如果某些值不能转换为数字类型怎么办?
to_numeric()
错误
关键字参数允许您强制非数字值为nan
,或者简单地忽略包含这些值的列。这是一个使用一系列字符串
s
的示例,该s
具有对象dtype:如果无法转换值,则默认行为是为了提高行为。在这种情况下,它不能应对字符串的“熊猫”:
而不是失败,我们可能希望“熊猫”被视为缺失/错误的数字值。我们可以使用
errors
关键字参数:errors
的第三个选项,我们可以将无效的值胁到nan
, 如下遇到值:最后一个选项对于转换整个数据框特别有用,但不知道我们的哪个列可以可靠地转换为数字类型。在这种情况下,只需写入:
该函数将应用于数据框的每一列。可以转换为数字类型的列将被转换,而不能单独保留无法(例如它们包含非数字字符串或日期)的列(例如它们包含非数字字符串或日期)。
默认情况下
,以
to_numeric() /a>将为您提供
int64
或float64
dtype(或平台本地的任何整数宽度)。通常是您想要的,但是如果您想保存一些内存并使用更紧凑的DTYPE,例如
float32
或int8
,该怎么办?to_numeric()
选项降低到
'整数'
,'aigned'
,'unsigned'
,'float'
。这是整数类型的简单系列s
的示例:降低到
'integer'
使用可以保持值的最小整数:降低到
'float'
类似地选择比正常浮动类型要小:2。
astype()
astype()
方法使您可以明确您想要数据框架或系列的dtype。它的通用性非常多,您可以尝试从一种类型转到其他类型。基本用法
只需选择一种类型:您可以使用numpy dtype(例如
np.int16
),一些python类型(例如bool)或pandas特定类型(例如分类dtype)。在要转换的对象上调用该方法,然后 将尝试为您转换它:
注意我说“尝试” - 如果
astype()
不知道如何转换系列或数据框中的值,它将引起错误。例如,如果您有一个nan
或inf
值,您将获得一个错误,试图将其转换为整数。从PANDAS 0.20.0开始,可以通过传递
errors ='gignore'
来抑制此错误。您的原始对象将被返回。请小心
astype() >功能强大,但有时会“错误地”转换值。例如:
这些是小整数,那么如何转换为未签名的8位类型以节省内存?
转换起作用,但是-7被包裹为249(即2 8 -7)!
尝试使用
pd.to_numeric(s,downcast ='unsigned')
尝试降落可以帮助防止此错误。3。
phy_objects()
pandas的0.21.0版本介绍了方法
peasun_objects()
用于将对象数据类型具有更特定类型(软转换)的数据帧的转换列。例如,这是一个具有两列对象类型的数据框架。一个人持有实际的整数,另一个则保留代表整数的字符串:
使用 exele_objects() ,您可以将'a''''''''''''''''''''''''''''a''a'''to'a'''to'to'a'':b'列
'b'更改为单独,因为它的值是字符串,而不是整数。如果您想将这两个列强制为整数类型,则可以改用
df.astype(int)
。4。
convert_dtypes()
版本1.0及更高版本包括一种方法
convert_dtypes()
将系列和数据框列转换为支持pd.na
缺失值的最佳dtype。在这里,“最佳可能”是指最适合保持值的类型。例如,这是熊猫整数类型,如果所有值都是整数(或缺失值):将Python Integer对象的对象列转换为
int64
,numpyint32 值,将成为pandas dtype
int32
。使用我们的
对象
dataframedf
,我们得到以下结果:由于列''a'''''hold整数值,它被转换为
int64
类型(与int64
不同,它能够保持缺失值。列“ b”包含字符串对象,因此已更改为pandas'
字符串
dtype。默认情况下,此方法将从每列中的对象值中推断出类型。我们可以通过传递
exex_Objects = false
:现在列'a'保留对象列:Pandas知道它可以被描述为“整数”列(内部ran
You have four main options for converting types in pandas:
to_numeric()
- provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See alsoto_datetime()
andto_timedelta()
.)astype()
- convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).infer_objects()
- a utility method to convert object columns holding Python objects to a pandas type if possible.convert_dtypes()
- convert DataFrame columns to the "best possible" dtype that supportspd.NA
(pandas' object to indicate a missing value).Read on for more detailed explanations and usage of each of these methods.
1.
to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use
pandas.to_numeric()
.This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to
to_numeric()
is a Series or a single column of a DataFrame.As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
You can also use it to convert multiple columns of a DataFrame via the
apply()
method:As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric()
also takes anerrors
keyword argument that allows you to force non-numeric values to beNaN
, or simply ignore columns containing these values.Here's an example using a Series of strings
s
which has the object dtype:The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to
NaN
as follows using theerrors
keyword argument:The third option for
errors
is just to ignore the operation if an invalid value is encountered:This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with
to_numeric()
will give you either anint64
orfloat64
dtype (or whatever integer width is native to your platform).That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like
float32
, orint8
?to_numeric()
gives you the option to downcast to either'integer'
,'signed'
,'unsigned'
,'float'
. Here's an example for a simple seriess
of integer type:Downcasting to
'integer'
uses the smallest possible integer that can hold the values:Downcasting to
'float'
similarly picks a smaller than normal floating type:2.
astype()
The
astype()
method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.Basic usage
Just pick a type: you can use a NumPy dtype (e.g.
np.int16
), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).Call the method on the object you want to convert and
astype()
will try and convert it for you:Notice I said "try" - if
astype()
does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have aNaN
orinf
value you'll get an error trying to convert it to an integer.As of pandas 0.20.0, this error can be suppressed by passing
errors='ignore'
. Your original object will be returned untouched.Be careful
astype()
is powerful, but it will sometimes convert values "incorrectly". For example:These are small integers, so how about converting to an unsigned 8-bit type to save memory?
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using
pd.to_numeric(s, downcast='unsigned')
instead could help prevent this error.3.
infer_objects()
Version 0.21.0 of pandas introduced the method
infer_objects()
for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
Using
infer_objects()
, you can change the type of column 'a' to int64:Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use
df.astype(int)
instead.4.
convert_dtypes()
Version 1.0 and above includes a method
convert_dtypes()
to convert Series and DataFrame columns to the best possible dtype that supports thepd.NA
missing value.Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to
Int64
, a column of NumPyint32
values, will become the pandas dtypeInt32
.With our
object
DataFramedf
, we get the following result:Since column 'a' held integer values, it was converted to the
Int64
type (which is capable of holding missing values, unlikeint64
).Column 'b' contained string objects, so was changed to pandas'
string
dtype.By default, this method will infer the type from object values in each column. We can change this by passing
infer_objects=False
:Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran
infer_dtype
) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.