处理巨大的熊猫数据框

发布于 2025-02-11 12:43:20 字数 837 浏览 1 评论 0原文

我有一个巨大的数据库(约500GB左右),可以将其放入熊猫中。数据集包含39705210的观测值。您可以想象,Python甚至打开它也很难。现在,我试图使用Dask将其导出到CDV中的20个分区中:

import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20)  # Dask DataFrame has 20 partitions

dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})

但是,当我试图通过这样做掉落一些行时:

merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)

内核突然停止。因此,我的问题是:

  1. 有没有办法使用dask(还是阻止内核粉碎的另一种方式)掉落所需的行?
  2. 您知道一种减轻数据集或在Python中处理数据集的方法(例如,在同时进行一些基本的描述性统计数据)除了删除观测值之外吗?
  3. 您是否知道一种在不单独保存n个分区的情况下并行出口大熊猫DB作为CSV的方法(如Dask完成)吗?

谢谢

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:

import dask.dataframe as dd
dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20)  # Dask DataFrame has 20 partitions

dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv')
#merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})

However when I am trying to drop some of the rows e.g. by doing:

merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)

the kernel suddenly stops. So my questions are:

  1. is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
  2. do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
  3. do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

迷荒 2025-02-18 12:43:20

Dask DataFrames不支持Inploph Kwarg,因为每个分区和后续操作都延迟/懒惰。但是,就像在大熊猫中一样,可以将结果分配给相同的数据框架:

df = merge_bodytextknown5  # this line is for easier readability
mask = df['confscore'] != 3  # note the inversion of the requirement

df = df[mask]

如果有多种条件,bask可以重新定义,例如测试两个值:

mask = ~df['confscore'].isin([3,4])

dask将跟踪操作的操作,但是,至关重要的是,在要求/需要之前不会启动计算。例如,保存CSV文件的语法非常pandas类似:

df.to_csv('test.csv', index=False, single_file=True) # this save to one file

df.to_csv('test_*.csv', index=False) # this saves one file per dask dataframe partition

Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it's possible to assign the result to the same dataframe:

df = merge_bodytextknown5  # this line is for easier readability
mask = df['confscore'] != 3  # note the inversion of the requirement

df = df[mask]

If there are multiple conditions, mask can be redefined, for example to test two values:

mask = ~df['confscore'].isin([3,4])

Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much pandas-like:

df.to_csv('test.csv', index=False, single_file=True) # this save to one file

df.to_csv('test_*.csv', index=False) # this saves one file per dask dataframe partition
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文