Dask 内存不足 (16GB) 使用 apply 时

发布于 2025-01-15 21:39:22 字数 1006 浏览 3 评论 0原文

我正在尝试对大约 3.5GB+(组合 csv 大小)的数据(由 6 个 csv 组合而成)进行一些字符串操作。

**

**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used   : Dask**
Shape of Combined Df : 6 Million rows and 57 columns

**

我有一种方法可以从基本列中消除不需要的字符,例如:

def stripper(x):
    try:
        if type(x) != float or type(x) != pd._libs.missing.NAType:
            x = re.sub(r"[^\w]+", "", x).upper()
    except Exception as ex:
        pass
    return x

我将上述方法应用于某些列,如 ::

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)

并且我还用另一列中的值填充列的空值,如下所示:

df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])

这些是我需要执行这两个操作,之后我只是执行 .head() 来获取值(作为 dask 的惰性求值方法)。

temp_df = df.head(10000)

但是当我这样做时,它会继续吃内存,我的 16 GB 内存总量会变为零,并且内核会死掉。

我该如何解决这个问题?任何帮助将不胜感激。

I am trying to perfrom some string manipulation on data (combined from 6 csvs) , of about 3.5GB+(combined csv size).

**

**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used   : Dask**
Shape of Combined Df : 6 Million rows and 57 columns

**

I have a method that just eliminates unwanted characters from essential columns like:

def stripper(x):
    try:
        if type(x) != float or type(x) != pd._libs.missing.NAType:
            x = re.sub(r"[^\w]+", "", x).upper()
    except Exception as ex:
        pass
    return x

And I am applying above method to certain columns as ::

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)

And also i am filling null values of a column with the values from another column as:

df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])

These are the two operation i need to perform and after these i am just doing .head() for getting value ( As dask work on lazy evaluation method).

temp_df = df.head(10000)

But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.

How can i solve this issue ?? Any help would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

七度光 2025-01-22 21:39:22

我不熟悉 Dask,但在我看来,您可以对每个使用.str.replace,而不是对每个行使用自定义函数< /em>,并寻求更加矢量化的解决方案:

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)

I'm not familiar with Dask, but it seems to me like you can use .str.replace for each column instead of a custom function for each row, and and go for a more vectorized solution:

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)
只是一片海 2025-01-22 21:39:22

要扩展@richardec的解决方案,在Dask中您可以直接使用 DataFrame.replaceSeries.str.upper,它应该比使用 apply 更快。例如:

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame(
        {'a': [1, 'kdj821', '* dk0 '],
         'b': ['!23d', 'kdj821', '* dk0 '],
         'c': ['!23d', 'kdj821', None]}),
    npartitions=2)

ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()

最好知道您将 Dask DataFrame 分成了多少个分区 - 每个分区都应该适合内存(即 < 1GB),但您也不希望有太多(请参阅DataFrame 中的最佳实践Dask 文档)。

To expand on @richardec's solution, in Dask you can directly use DataFrame.replace and Series.str.upper, which should be faster than using an apply. For example:

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame(
        {'a': [1, 'kdj821', '* dk0 '],
         'b': ['!23d', 'kdj821', '* dk0 '],
         'c': ['!23d', 'kdj821', None]}),
    npartitions=2)

ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()

It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (i.e. < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文