Pandas -MemoryError：无法分配220。MIB

发布于 2025-01-19 15:18:29 字数 842 浏览 2 评论 0原文

因此，我有一个订单数据框，以订单日期为索引，我对其进行了设置：

df = df.set_index('ORDER_ENTRY_DATE', drop=False)

在下面的代码中，我创建了一个新功能，其中包含特定客户在过去 8 周内成功支付的总金额。（不包括当前订单）

df["LAST_8_WEEKS_SUCCESSFUL"] = (df["PAYMENT_SUCCESSFUL"].mul(df["TOTAL_AMOUNT"])
                                                                .groupby(df["CUST_NO"])
                                                                .transform(lambda x: x.rolling(window='56D', min_periods= 1).sum().shift())
                                                                .fillna(0)
                                        )

我已经在数据集的较小版本上测试了此代码，它工作正常，但是当在成熟的 2800 万行数据集上运行它时，我收到内存错误

MemoryError: Unable to allocate 220. MiB 用于形状为 (28879273,) 且数据类型为 int64 的数组

有没有其他方法可以在不需要 220 MiB RAM 的情况下完成此任务？我的代码效率太低了吗？

原文

So I have a data frame of orders, with the order date as the index, which I set so:

df = df.set_index('ORDER_ENTRY_DATE', drop=False)

In the code below I create a new feature, containing the total amount successfully paid in the last 8 weeks for a specific customer. (excluding current order)

df["LAST_8_WEEKS_SUCCESSFUL"] = (df["PAYMENT_SUCCESSFUL"].mul(df["TOTAL_AMOUNT"])
                                                                .groupby(df["CUST_NO"])
                                                                .transform(lambda x: x.rolling(window='56D', min_periods= 1).sum().shift())
                                                                .fillna(0)
                                        )

I have tested this code on a smaller version of my dataset and it works fine, but when running it on the full fledged 28 million rows dataset, I get a memory error

MemoryError: Unable to allocate 220. MiB for an array with shape (28879273,) and data type int64

Is there any other way to accomplish this without needing 220 MiB RAM? Is my code way too inefficient?

分享到QQ

分享到微博