如何替换数据帧列中的NAN值

发布于 2025-02-10 13:35:20 字数 1010 浏览 2 评论 0原文

我有一个pandas dataframe如下:

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

当我尝试将函数应用于金额列时,我会收到以下错误:

ValueError: cannot convert float NaN to integer

我尝试使用Math.isnan,pandas'应用函数。替换方法,.sparse pandas 0.9的数据属性,如果nan == nan == nan语句在函数中;我还研究了 this q/a ;他们都没有工作。

我该怎么做?

I have a Pandas Dataframe as below:

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

When I try to apply a function to the Amount column, I get the following error:

ValueError: cannot convert float NaN to integer

I have tried applying a function using math.isnan, pandas' .replace method, .sparse data attribute from pandas 0.9, if NaN == NaN statement in a function; I have also looked at this Q/A; none of them works.

How do I do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

梨涡 2025-02-17 13:35:20

noreferrer“> dataframe.fillna() 系列。 fillna() 将为您执行此操作。

示例:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

要仅在一个列中填充NAN,请仅选择该列。

In [12]: df[1] = df[1].fillna(0)

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

或者,您可以使用内置的特定功能:

df = df.fillna({1: 0})

DataFrame.fillna() or Series.fillna() will do this for you.

Example:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

To fill the NaNs in only one column, select just that column.

In [12]: df[1] = df[1].fillna(0)

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

Or you can use the built in column-specific functionality:

df = df.fillna({1: 0})
深海不蓝 2025-02-17 13:35:20

不能保证切片返回视图或副本。你可以做

df['column'] = df['column'].fillna(value)

It is not guaranteed that the slicing returns a view or a copy. You can do

df['column'] = df['column'].fillna(value)
掩耳倾听 2025-02-17 13:35:20

您可以使用NAN更改为0

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)

You could use replace to change NaN to 0:

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)
阿楠 2025-02-17 13:35:20

以下代码对我有用。

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)

The below code worked for me.

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)
你曾走过我的故事 2025-02-17 13:35:20

我只是想提供一个特殊情况。如果您使用的是多指数或以其他方式使用索引 - 缝线,则inplace = true选项可能不足以更新您选择的切片。例如,在2x2级的多指数中,这不会更改任何值(从PANDAS 0.15开始):

idx = pd.IndexSlice
df.loc[idx[:,mask_1], idx[mask_2,:]].fillna(value=0, inplace=True)

“问题”是链接破坏了更新原始数据框的FillNA的能力。我将“问题”放在引号中,因为设计决策的理由有很多,导致在某些情况下不会通过这些链条进行解释。另外,这是一个复杂的示例(尽管我真的遇到了它),但是根据您的切片方式,相同的索引级别也可能更少。

The solution is DataFrame.update< /a>:

df.update(df.loc[idx[:,mask_1], idx[[mask_2],:]].fillna(value=0))

这是一行,读得很好(有点),并消除了与中间变量或循环的任何不必要的混乱,同时允许您将fillna应用于您喜欢的任何多级切片!

如果有人能找到该位置不起作用,请在评论中发布,我一直在弄乱它并查看来源,它似乎至少解决了我的多索引切片问题。

I just wanted to provide a special case. If you're using a multi-index or otherwise using an index-slicer, the inplace=True option may not be enough to update the slice you've chosen. For example in a 2x2 level multi-index this will not change any values (as of pandas 0.15):

idx = pd.IndexSlice
df.loc[idx[:,mask_1], idx[mask_2,:]].fillna(value=0, inplace=True)

The "problem" is that the chaining breaks the fillna ability to update the original dataframe. I put "problem" in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice.

The solution is DataFrame.update:

df.update(df.loc[idx[:,mask_1], idx[[mask_2],:]].fillna(value=0))

It's one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like!

If anybody can find places this doesn't work please post in the comments, I've been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.

生死何惧 2025-02-17 13:35:20

您还可以使用字典来填充数据框中特定列的NAN值,而不是用一些OneValue填充所有DF。

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)

You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue.

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)
樱花落人离去 2025-02-17 13:35:20

填充缺失值的简便方法: -

填充 字符串列:当字符串列缺少值和NAN值时。

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

填充 数字列:当数字列缺少值和NAN值时。

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

用零填充南:

df['column name'].fillna(0, inplace = True)

Easy way to fill the missing values:-

filling string columns: when string columns have missing values and NaN values.

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

filling numeric columns: when the numeric columns have missing values and NaN values.

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

filling NaN with zero:

df['column name'].fillna(0, inplace = True)
你丑哭了我 2025-02-17 13:35:20

如果Inplace = false,请替换熊猫中的Na值

df['column_name'].fillna(value_to_be_replaced, inplace=True)

,而不是更新DF(DataFrame),它将返回修改后的值。

To replace na values in pandas

df['column_name'].fillna(value_to_be_replaced, inplace=True)

if inplace=False, instead of updating the df (dataframe) it will return the modified values.

南汐寒笙箫 2025-02-17 13:35:20

考虑到上表中的特定列是整数类型,以下是一个解决方案:

df['Amount'] = df['Amount'].fillna(0).astype(int)

类似地,您可以用floatstr 等等。

特别是,我将考虑数据类型比较同一列的各种值。

Considering the particular column Amount in the above table is of integer type, the following would be a solution:

df['Amount'] = df['Amount'].fillna(0).astype(int)

Similarly, you can fill it with various data types like float, str and so on.

In particular, I would consider datatype to compare various values of the same column.

那伤。 2025-02-17 13:35:20

以不同的方式替换不同列的NAN:

replacement = {'column_A': 0, 'column_B': -999, 'column_C': -99999}
df.fillna(value=replacement)

To replace nan in different columns with different ways:

replacement = {'column_A': 0, 'column_B': -999, 'column_C': -99999}
df.fillna(value=replacement)
孤千羽 2025-02-17 13:35:20

这对我有用,但没有人提到。有什么问题吗?

df.loc[df['column_name'].isnull(), 'column_name'] = 0

This works for me, but no one's mentioned it. could there be something wrong with it?

df.loc[df['column_name'].isnull(), 'column_name'] = 0
海风掠过北极光 2025-02-17 13:35:20

主要有两种选择。如果插补或填充缺失值 nan / np.nan < / em>仅具有数值替换(跨列(s):

df ['MANTER']。fillna(value = none,method = none,method = = ,轴= 1,)足够:

从文档:

值:标量,dict,series或dataframe
用于填充孔的价值(例如0),或者
dict/series/series/dataframe值的值指定要使用的值
每个索引(用于系列)或列(对于数据框)。 (值不
在dict/series/dataframe中,将不填充)。这个值不能
成为列表。

这意味着不再允许“弦”或“常数”推出。

有关更专业的归档,请使用 >

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])

There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s):

df['Amount'].fillna(value=None, method= ,axis=1,) is sufficient:

From the Documentation:

value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). (values not
in the dict/Series/DataFrame will not be filled). This value cannot
be a list.

Which means 'strings' or 'constants' are no longer permissable to be imputed.

For more specialized imputations use SimpleImputer():

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])
就像说晚安 2025-02-17 13:35:20

如果要填充NAN的特定列,则可以使用LOC:

d1 = {"Col1": ['A', 'B', 'C'],
      "fruits": ['Avocado', 'Banana', 'NaN']}
d1 = pd.DataFrame(d1)

输出:

  Col1   fruits
0    A  Avocado
1    B   Banana
2    C      NaN
d1.loc[d1.Col1=='C', 'fruits'] = 'Carrot'

输出:

  Col1   fruits
0    A  Avocado
1    B   Banana
2    C   Carrot

If you want to fill NaN for a specific column you can use loc:

d1 = {"Col1": ['A', 'B', 'C'],
      "fruits": ['Avocado', 'Banana', 'NaN']}
d1 = pd.DataFrame(d1)

output:

  Col1   fruits
0    A  Avocado
1    B   Banana
2    C      NaN
d1.loc[d1.Col1=='C', 'fruits'] = 'Carrot'

output:

  Col1   fruits
0    A  Avocado
1    B   Banana
2    C   Carrot
嘿嘿嘿 2025-02-17 13:35:20

我认为这也值得一提和解释
fillna()的参数配置
类似于方法,轴,限制等。

从文档中,我们拥有:

Series.fillna(value=None, method=None, axis=None, 
                 inplace=False, limit=None, downcast=None)
Fill NA/NaN values using the specified method.

参数

value [scalar, dict, Series, or DataFrame] Value to use to 
 fill holes (e.g. 0), alternately a dict/Series/DataFrame 
 of values specifying which value to use for each index 
 (for a Series) or column (for a DataFrame). Values not in 
 the dict/Series/DataFrame will not be filled. This 
 value cannot be a list.

method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, 
 default None] Method to use for filling holes in 
 reindexed Series pad / ffill: propagate last valid 
 observation forward to next valid backfill / bfill: 
 use next valid observation to fill gap axis 
 [{0 or ‘index’}] Axis along which to fill missing values.

inplace [bool, default False] If True, fill 
 in-place. Note: this will modify any other views
 on this object (e.g., a no-copy slice for a 
 column in a DataFrame).

limit [int,defaultNone] If method is specified, 
 this is the maximum number of consecutive NaN 
 values to forward/backward fill. In other words, 
 if there is a gap with more than this number of 
 consecutive NaNs, it will only be partially filled. 
 If method is not specified, this is the maximum 
 number of entries along the entire axis where NaNs
 will be filled. Must be greater than 0 if not None.

downcast [dict, default is None] A dict of item->dtype 
 of what to downcast if possible, or the string ‘infer’ 
 which will try to downcast to an appropriate equal 
 type (e.g. float64 to int64 if possible).

确定。让我们从method =参数开始
具有前向填充(FFILL)和向后填充(BFILL)
ffill正在对以前的复制进行复制
非缺失值。

例如:

import pandas as pd
import numpy as np
inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}]
df = pd.DataFrame(inp)

  c1       c2      c3
0   10.0     NaN      200.0
1   NaN   110.0 210.0
2   12.0     NaN      220.0
3   12.0     130.0 NaN
4   12.0     NaN      240.0

前向填充:

df.fillna(method="ffill")

    c1     c2      c3
0   10.0      NaN 200.0
1   10.0    110.0   210.0
2   12.0    110.0   220.0
3   12.0    130.0   220.0
4   12.0    130.0   240.0

向后填充:

df.fillna(method="bfill")

    c1      c2     c3
0   10.0    110.0   200.0
1   12.0    110.0   210.0
2   12.0    130.0   220.0
3   12.0    130.0   240.0
4   12.0      NaN   240.0

轴参数可帮助我们选择填充方向:

填充方向:

ffill:

Axis = 1 
Method = 'ffill'
----------->
  direction 

df.fillna(method="ffill", axis=1)

       c1   c2      c3
0   10.0     10.0   200.0
1    NaN    110.0   210.0
2   12.0     12.0   220.0
3   12.0    130.0   130.0
4   12.0    12.0    240.0

Axis = 0 # by default 
Method = 'ffill'
|
|       # direction 
|
V
e.g: # This is the ffill default
df.fillna(method="ffill", axis=0)

    c1     c2      c3
0   10.0      NaN   200.0
1   10.0    110.0   210.0
2   12.0    110.0   220.0
3   12.0    130.0   220.0
4   12.0    130.0   240.0

bfill: bfill:

axis= 0
method = 'bfill'
^
|
|
|
df.fillna(method="bfill", axis=0)

    c1     c2      c3
0   10.0    110.0   200.0
1   12.0    110.0   210.0
2   12.0    130.0   220.0
3   12.0    130.0   240.0
4   12.0      NaN   240.0

axis = 1
method = 'bfill'
<-----------
df.fillna(method="bfill", axis=1)
        c1     c2       c3
0    10.0   200.0   200.0
1   110.0   110.0   210.0
2    12.0   220.0   220.0
3    12.0   130.0     NaN
4    12.0   240.0   240.0

# alias:
#  'fill' == 'pad' 
#   bfill == backfill

limit参数:

df
    c1     c2      c3
0   10.0      NaN   200.0
1    NaN    110.0   210.0
2   12.0      NaN   220.0
3   12.0    130.0     NaN
4   12.0      NaN   240.0

仅跨列替换第一个NAN元素:

df.fillna(value = 'Unavailable', limit=1)
            c1           c2          c3
0          10.0 Unavailable       200.0
1   Unavailable       110.0       210.0
2          12.0         NaN       220.0
3          12.0       130.0 Unavailable
4          12.0         NaN       240.0

df.fillna(value = 'Unavailable', limit=2)

           c1            c2          c3
0          10.0 Unavailable       200.0
1   Unavailable       110.0       210.0
2          12.0 Unavailable       220.0
3          12.0       130.0 Unavailable
4          12.0         NaN       240.0

Downcast参数:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      4 non-null      float64
 1   c2      2 non-null      float64
 2   c3      4 non-null      float64
dtypes: float64(3)
memory usage: 248.0 bytes

df.fillna(method="ffill",downcast='infer').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      5 non-null      int64  
 1   c2      4 non-null      float64
 2   c3      5 non-null      int64  
dtypes: float64(1), int64(2)
memory usage: 248.0 bytes

I think it's also worth mention and explain
the parameters configuration of fillna()
like Method, Axis, Limit, etc.

From the documentation we have:

Series.fillna(value=None, method=None, axis=None, 
                 inplace=False, limit=None, downcast=None)
Fill NA/NaN values using the specified method.

Parameters

value [scalar, dict, Series, or DataFrame] Value to use to 
 fill holes (e.g. 0), alternately a dict/Series/DataFrame 
 of values specifying which value to use for each index 
 (for a Series) or column (for a DataFrame). Values not in 
 the dict/Series/DataFrame will not be filled. This 
 value cannot be a list.

method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, 
 default None] Method to use for filling holes in 
 reindexed Series pad / ffill: propagate last valid 
 observation forward to next valid backfill / bfill: 
 use next valid observation to fill gap axis 
 [{0 or ‘index’}] Axis along which to fill missing values.

inplace [bool, default False] If True, fill 
 in-place. Note: this will modify any other views
 on this object (e.g., a no-copy slice for a 
 column in a DataFrame).

limit [int,defaultNone] If method is specified, 
 this is the maximum number of consecutive NaN 
 values to forward/backward fill. In other words, 
 if there is a gap with more than this number of 
 consecutive NaNs, it will only be partially filled. 
 If method is not specified, this is the maximum 
 number of entries along the entire axis where NaNs
 will be filled. Must be greater than 0 if not None.

downcast [dict, default is None] A dict of item->dtype 
 of what to downcast if possible, or the string ‘infer’ 
 which will try to downcast to an appropriate equal 
 type (e.g. float64 to int64 if possible).

Ok. Let's start with the method= Parameter this
have forward fill (ffill) and backward fill(bfill)
ffill is doing copying forward the previous
non missing value.

e.g. :

import pandas as pd
import numpy as np
inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}]
df = pd.DataFrame(inp)

  c1       c2      c3
0   10.0     NaN      200.0
1   NaN   110.0 210.0
2   12.0     NaN      220.0
3   12.0     130.0 NaN
4   12.0     NaN      240.0

Forward fill:

df.fillna(method="ffill")

    c1     c2      c3
0   10.0      NaN 200.0
1   10.0    110.0   210.0
2   12.0    110.0   220.0
3   12.0    130.0   220.0
4   12.0    130.0   240.0

Backward fill:

df.fillna(method="bfill")

    c1      c2     c3
0   10.0    110.0   200.0
1   12.0    110.0   210.0
2   12.0    130.0   220.0
3   12.0    130.0   240.0
4   12.0      NaN   240.0

The Axis Parameter help us to choose the direction of the fill:

Fill directions:

ffill:

Axis = 1 
Method = 'ffill'
----------->
  direction 

df.fillna(method="ffill", axis=1)

       c1   c2      c3
0   10.0     10.0   200.0
1    NaN    110.0   210.0
2   12.0     12.0   220.0
3   12.0    130.0   130.0
4   12.0    12.0    240.0

Axis = 0 # by default 
Method = 'ffill'
|
|       # direction 
|
V
e.g: # This is the ffill default
df.fillna(method="ffill", axis=0)

    c1     c2      c3
0   10.0      NaN   200.0
1   10.0    110.0   210.0
2   12.0    110.0   220.0
3   12.0    130.0   220.0
4   12.0    130.0   240.0

bfill:

axis= 0
method = 'bfill'
^
|
|
|
df.fillna(method="bfill", axis=0)

    c1     c2      c3
0   10.0    110.0   200.0
1   12.0    110.0   210.0
2   12.0    130.0   220.0
3   12.0    130.0   240.0
4   12.0      NaN   240.0

axis = 1
method = 'bfill'
<-----------
df.fillna(method="bfill", axis=1)
        c1     c2       c3
0    10.0   200.0   200.0
1   110.0   110.0   210.0
2    12.0   220.0   220.0
3    12.0   130.0     NaN
4    12.0   240.0   240.0

# alias:
#  'fill' == 'pad' 
#   bfill == backfill

limit parameter:

df
    c1     c2      c3
0   10.0      NaN   200.0
1    NaN    110.0   210.0
2   12.0      NaN   220.0
3   12.0    130.0     NaN
4   12.0      NaN   240.0

Only replace the first NaN element across columns:

df.fillna(value = 'Unavailable', limit=1)
            c1           c2          c3
0          10.0 Unavailable       200.0
1   Unavailable       110.0       210.0
2          12.0         NaN       220.0
3          12.0       130.0 Unavailable
4          12.0         NaN       240.0

df.fillna(value = 'Unavailable', limit=2)

           c1            c2          c3
0          10.0 Unavailable       200.0
1   Unavailable       110.0       210.0
2          12.0 Unavailable       220.0
3          12.0       130.0 Unavailable
4          12.0         NaN       240.0

downcast parameter:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      4 non-null      float64
 1   c2      2 non-null      float64
 2   c3      4 non-null      float64
dtypes: float64(3)
memory usage: 248.0 bytes

df.fillna(method="ffill",downcast='infer').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      5 non-null      int64  
 1   c2      4 non-null      float64
 2   c3      5 non-null      int64  
dtypes: float64(1), int64(2)
memory usage: 248.0 bytes
阳光下的泡沫是彩色的 2025-02-17 13:35:20

如果您使用read_csv等读取具有缺少值的数据,则可以将keep_default_na = false读取为空字符串(>“ “)。在特定情况下,这很有用,因为它可以在一个函数呼叫中实现fillna fillna 替换的功能(在内存中少一个复制)。

df = pd.read_csv(filepath, keep_default_na=False)

# the above is same as
df = pd.read_csv(filepath).fillna("")
# or
df = pd.read_csv(filepath).replace(np.nan, "")

如果数据帧包含数字,则可以将DTYPE传递到read_csv以使用所需的DTYPE列构造数据框。

df = pd.read_csv(filepath, keep_default_na=False, dtype={"col1": "Int64", "col2": "string", "col3": "Float64"})

替换NAN的另一种方法是通过mask()/其中()方法。它们是类似的方法,其中mask替换满足条件的值,而 代替不满足条件的值。因此,要使用,我们只需要过滤NAN值并将其替换为所需值。

import pandas as pd

df = pd.DataFrame({'a': [1, float('nan'), float('nan')], 'b': [float('nan'), 'a', 'b']})

df = df.where(df.notna(), 10)                 # for the entire dataframe
df['a'] = df['a'].where(df['a'].notna(), 10)  # for a single column

这种方法的优点是我们可以将其有条件地用它替换NAN值。以下是一个示例,其中df中的NAN值被10代替,如果条件cond> cond得到满足。

cond = pd.DataFrame({'a': [True, True, False], 'b':[False, True, True]})
df = df.mask(df.isna() & cond, 10)

在引擎盖下,fillna()调用其中() source ),又呼叫numpy.where()(如果数据帧很小,并且numexpr) 。 “>来源)。因此,fillna/mask/其中在更换NAN值的目的基本上是相同的方法。另一方面,替换()(此页面上给出的另一种方法)是numpy.putmask操作( source source )。因为numexprnumpy对于大型数组而言,numpy,对于非常大的数据范围,替换>替换的表现可能超过其他方法。


从切线上说,数据框中具有文字字符串'nan'而不是实际的NAN值是常见的。为确保数据框确实具有NAN值,请使用df.isna()。任何()检查。如果它返回false,则在包含NAN时,您可能会有'NAN'字符串,在这种情况下,请使用replast将它们转换为NAN或更好,甚至更好,甚至更好替换为要替换为替换的价值。例如:

df = pd.DataFrame({'a': ['a', 'b', 'NaN']})
df = df.replace('NaN', 'c')

If you're reading data with missing values from a file using read_csv etc., then you can pass keep_default_na=False to read missing values as empty strings (""). In specific cases, this is useful because it achieves what fillna or replace does in one function call (with one less copy in memory).

df = pd.read_csv(filepath, keep_default_na=False)

# the above is same as
df = pd.read_csv(filepath).fillna("")
# or
df = pd.read_csv(filepath).replace(np.nan, "")

If the dataframe contains numbers, then you can pass dtypes to read_csv to construct a dataframe with the desired dtype columns.

df = pd.read_csv(filepath, keep_default_na=False, dtype={"col1": "Int64", "col2": "string", "col3": "Float64"})

Another way to replace NaN is via mask()/where() methods. They are similar methods where mask replaces values that satisfy the condition whereas where replaces values that do not satisfy the condition. So to use, we just have to filter the NaN values and replace them with the desired value.

import pandas as pd

df = pd.DataFrame({'a': [1, float('nan'), float('nan')], 'b': [float('nan'), 'a', 'b']})

df = df.where(df.notna(), 10)                 # for the entire dataframe
df['a'] = df['a'].where(df['a'].notna(), 10)  # for a single column

The advantage of this method is that we can conditionally replace NaN values with it. The following is an example where NaN values in df are replaced by 10 if the condition cond is satisfied.

cond = pd.DataFrame({'a': [True, True, False], 'b':[False, True, True]})
df = df.mask(df.isna() & cond, 10)

Under the hood, fillna() calls where() (source) which in turn calls numpy.where() if the dataframe is small and numexpr.evaluate if it's large (source). So fillna/mask/where are essentially the same method for the purposes of replacing NaN values. On the other hand, replace() (another method given on this page) is a numpy.putmask operation (source). Because numexpr is a faster than numpy for large arrays, for very large dataframes, replace may be outperformed by the other methods.


On a tangential note, it's common for a dataframe to have a literal string 'NaN' instead of an actual NaN value. To make sure that a dataframe indeed has NaN values, check with df.isna().any(). If it returns False, when it should contain NaN, then you probably have 'NaN' strings, in which case, use replace to convert them into NaN or, even better, replace with the value you're meant to replace it with. For example:

df = pd.DataFrame({'a': ['a', 'b', 'NaN']})
df = df.replace('NaN', 'c')
满栀 2025-02-17 13:35:20

使用 lambda 表达式,也可以用0代替NAN。

以下是一个示例:

dss3 = dss2['Score'].apply(lambda x: 0 if dss2['Score'].isnull else x)
print(dss3)

Using lambda expression, it is also possible to replace NaN with 0.

Below is an example:

dss3 = dss2['Score'].apply(lambda x: 0 if dss2['Score'].isnull else x)
print(dss3)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文