DataFrame 到 Panel 通过 Pandas 的非唯一列索引
下面的代码应该可以实现我想要的功能,但是当循环完成 20% 时,它需要 10GB 的内存。
# In [4]: type(pd)
# Out[4]: pandas.sparse.frame.SparseDataFrame
memid = unique(pd.Member)
pan = {}
for mem in memid:
pan[mem] = pd[pd.Member==mem]
goal = pandas.Panel(pan)
The following code should do what I want but it takes 10gb of ram by the time it is 20% done with the loop.
# In [4]: type(pd)
# Out[4]: pandas.sparse.frame.SparseDataFrame
memid = unique(pd.Member)
pan = {}
for mem in memid:
pan[mem] = pd[pd.Member==mem]
goal = pandas.Panel(pan)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在这里创建了一个 GitHub 问题。
https://github.com/wesm/pandas/issues/663
我很漂亮当然,我发现了 NumPy ndarray 视图之间的循环引用导致内存泄漏。刚刚提交了修复:
https://github.com/wesm/pandas/commit/4c3916310a86c3e4dab6d30858a984a6f4a64103
您可以从源代码安装并让我知道这是否可以解决您的问题吗?
顺便说一句,您可以尝试使用 SparsePanel 而不是 Panel,因为 Panel 会将所有子 DataFrame 转换为密集形式。
最后,您可以考虑使用 groupby 作为
O(N * M)
分割 SparseDataFrame 的替代方案。它甚至更短:pan = dict(pd.groupby('成员'))
I created a GitHub issue here.
https://github.com/wesm/pandas/issues/663
I'm pretty sure I identified a circular reference between NumPy ndarray views causing a memory leak. Just committed a fix:
https://github.com/wesm/pandas/commit/4c3916310a86c3e4dab6d30858a984a6f4a64103
Can you install from source and let me know if that fixes your problem?
BTW you might try using SparsePanel instead of Panel because Panel will convert all of the sub-DataFrames to dense form.
Lastly, you might consider using groupby as an alternative to the
O(N * M)
chopping-up of the SparseDataFrame. It's even shorter:pan = dict(pd.groupby('Member'))