处理巨大的熊猫数据框
我有一个巨大的数据库(约500GB左右),可以将其放入熊猫中。数据集包含39705210的观测值。您可以想象,Python甚至打开它也很难。现在,我试图使用Dask将其导出到CDV中的20个分区中:
import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20) # Dask DataFrame has 20 partitions
dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})
但是,当我试图通过这样做掉落一些行时:
merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)
内核突然停止。因此,我的问题是:
- 有没有办法使用dask(还是阻止内核粉碎的另一种方式)掉落所需的行?
- 您知道一种减轻数据集或在Python中处理数据集的方法(例如,在同时进行一些基本的描述性统计数据)除了删除观测值之外吗?
- 您是否知道一种在不单独保存n个分区的情况下并行出口大熊猫DB作为CSV的方法(如Dask完成)吗?
谢谢
I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:
import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20) # Dask DataFrame has 20 partitions
dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})
However when I am trying to drop some of the rows e.g. by doing:
merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)
the kernel suddenly stops. So my questions are:
- is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
- do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
- do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Dask DataFrames不支持Inploph Kwarg,因为每个分区和后续操作都延迟/懒惰。但是,就像在大熊猫中一样,可以将结果分配给相同的数据框架:
如果有多种条件,
bask
可以重新定义,例如测试两个值:dask将跟踪操作的操作,但是,至关重要的是,在要求/需要之前不会启动计算。例如,保存CSV文件的语法非常
pandas
类似:Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it's possible to assign the result to the same dataframe:
If there are multiple conditions,
mask
can be redefined, for example to test two values:Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much
pandas
-like: