Dask 内存不足 (16GB) 使用 apply 时

发布于 2025-01-15 21:39:22 字数 1006 浏览 3 评论 0原文

我正在尝试对大约 3.5GB+（组合 csv 大小）的数据（由 6 个 csv 组合而成）进行一些字符串操作。

**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used   : Dask**
Shape of Combined Df : 6 Million rows and 57 columns

我有一种方法可以从基本列中消除不需要的字符，例如：

def stripper(x):
    try:
        if type(x) != float or type(x) != pd._libs.missing.NAType:
            x = re.sub(r"[^\w]+", "", x).upper()
    except Exception as ex:
        pass
    return x

我将上述方法应用于某些列，如 ::

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)

并且我还用另一列中的值填充列的空值，如下所示：

df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])

这些是我需要执行这两个操作，之后我只是执行 .head() 来获取值（作为 dask 的惰性求值方法）。

temp_df = df.head(10000)

但是当我这样做时，它会继续吃内存，我的 16 GB 内存总量会变为零，并且内核会死掉。

我该如何解决这个问题？任何帮助将不胜感激。

原文

I am trying to perfrom some string manipulation on data (combined from 6 csvs) , of about 3.5GB+(combined csv size).

**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used   : Dask**
Shape of Combined Df : 6 Million rows and 57 columns

I have a method that just eliminates unwanted characters from essential columns like:

def stripper(x):
    try:
        if type(x) != float or type(x) != pd._libs.missing.NAType:
            x = re.sub(r"[^\w]+", "", x).upper()
    except Exception as ex:
        pass
    return x

And I am applying above method to certain columns as ::

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)

And also i am filling null values of a column with the values from another column as:

df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])

These are the two operation i need to perform and after these i am just doing .head() for getting value ( As dask work on lazy evaluation method).

temp_df = df.head(10000)

But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.

How can i solve this issue ?? Any help would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七度光 2025-01-22 21:39:22

我不熟悉 Dask，但在我看来，您可以对每个列使用.str.replace，而不是对每个行使用自定义函数< /em>，并寻求更加矢量化的解决方案：

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)

I'm not familiar with Dask, but it seems to me like you can use .str.replace for each column instead of a custom function for each row, and and go for a more vectorized solution:

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)

回复收藏 0 原文

只是一片海 2025-01-22 21:39:22

要扩展@richardec的解决方案，在Dask中您可以直接使用 DataFrame.replace 和 Series.str.upper，它应该比使用 apply 更快。例如：

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame(
        {'a': [1, 'kdj821', '* dk0 '],
         'b': ['!23d', 'kdj821', '* dk0 '],
         'c': ['!23d', 'kdj821', None]}),
    npartitions=2)

ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()

最好知道您将 Dask DataFrame 分成了多少个分区 - 每个分区都应该适合内存（即 < 1GB），但您也不希望有太多（请参阅DataFrame 中的最佳实践Dask 文档）。

To expand on @richardec's solution, in Dask you can directly use DataFrame.replace and Series.str.upper, which should be faster than using an apply. For example:

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame(
        {'a': [1, 'kdj821', '* dk0 '],
         'b': ['!23d', 'kdj821', '* dk0 '],
         'c': ['!23d', 'kdj821', None]}),
    npartitions=2)

ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()

It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (i.e. < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs).

回复收藏 0 原文

~没有更多了~