替换dask map_partitions中的现有列提供设置WithCopyWarning

发布于 2025-01-30 17:10:41 字数 1248 浏览 4 评论 0 原文

我正在使用 dask 使用 map_partitions dask dataframe中替换列 ID2 。结果是替换了值,但用 pandas 警告。

该警告是什么,以及如何在下面的示例中应用 .loc 建议?

pdf = pd.DataFrame({
    'dummy2': [10, 10, 10, 20, 20, 15, 10, 30, 20, 26],
    'id2': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2],
    'balance2': [150, 140, 130, 280, 260, 150, 140, 130, 280, 260]
})

ddf = dd.from_pandas(pdf, npartitions=3) 

def func2(df):
    df['id2'] = df['balance2'] + 1
    return df

ddf = ddf.map_partitions(func2)

ddf.compute()

c:\ users \ xxxxxx \ appdata \ local \ temp \ ipykernel_30076 \ 248155462.py:2:2:2: setterWithCopyWarning:一个值试图在一个副本上设置一个值 从数据框架切片。尝试使用.loc [row_indexer,col_indexer] = 值

请参阅文档中的注意事项: df ['id2'] = df ['Balance2'] + 1

I'm replacing column id2 in a dask dataframe using map_partitions. The result is that the values are replaced but with a pandas warning.

What is this warning and how to apply the .loc suggestion in the example below?

pdf = pd.DataFrame({
    'dummy2': [10, 10, 10, 20, 20, 15, 10, 30, 20, 26],
    'id2': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2],
    'balance2': [150, 140, 130, 280, 260, 150, 140, 130, 280, 260]
})

ddf = dd.from_pandas(pdf, npartitions=3) 

def func2(df):
    df['id2'] = df['balance2'] + 1
    return df

ddf = ddf.map_partitions(func2)

ddf.compute()

C:\Users\xxxxxx\AppData\Local\Temp\ipykernel_30076\248155462.py:2:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead

See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['id2'] = df['balance2'] + 1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏九 2025-02-06 17:10:41

一个快速的修复是添加数据框的副本:

def func2(df):
    df = df.copy() # will make a copy of the dataframe
    df['id2'] = df['balance2'] + 1
    return df

但是,据我了解,由于dask数据框的延迟性质意味着更改不会传播回DASK DATAMFRAME分区,因此不需要数据框架的副本。

更新:有一个相关问题,它解释了 copy .copy .copy in pandas in pandas 。在下面的摘要中,应用该函数将修改原始的pandas dataframe,这可能是不希望的:

from pandas import DataFrame

def addcol(df):
    df['a'] = 1
    return df

df = DataFrame()

df1 = addcol(df)
# without .copy, df is also modified, which might be undesirable

dask的上下文中,此警告只是一个警告,因此 .copy 。不需要。

from dask.dataframe import from_pandas
ddf = from_pandas(df, npartitions=1)
ddf1 = ddf.map_partitions(addcol)
# will show warning, but original ddf is not modified

A quick fix is to add copy of the dataframe:

def func2(df):
    df = df.copy() # will make a copy of the dataframe
    df['id2'] = df['balance2'] + 1
    return df

However, as I understand, copying of the dataframe is not required as the delayed nature of the dask dataframe means that the changes are not propagated back to the dask dataframe partitions.

Update: there is a relevant question which explains the reason for .copy in pandas. In the snippet below applying the function will modify the original pandas dataframe, which might be undesirable:

from pandas import DataFrame

def addcol(df):
    df['a'] = 1
    return df

df = DataFrame()

df1 = addcol(df)
# without .copy, df is also modified, which might be undesirable

In the context of dask this warning is just that, a warning, so .copy is not needed.

from dask.dataframe import from_pandas
ddf = from_pandas(df, npartitions=1)
ddf1 = ddf.map_partitions(addcol)
# will show warning, but original ddf is not modified
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文