Dask 内存不足 (16GB) 使用 apply 时
我正在尝试对大约 3.5GB+(组合 csv 大小)的数据(由 6 个 csv 组合而成)进行一些字符串操作。
**
**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used : Dask**
Shape of Combined Df : 6 Million rows and 57 columns
**
我有一种方法可以从基本列中消除不需要的字符,例如:
def stripper(x):
try:
if type(x) != float or type(x) != pd._libs.missing.NAType:
x = re.sub(r"[^\w]+", "", x).upper()
except Exception as ex:
pass
return x
我将上述方法应用于某些列,如 ::
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)
并且我还用另一列中的值填充列的空值,如下所示:
df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])
这些是我需要执行这两个操作,之后我只是执行 .head() 来获取值(作为 dask 的惰性求值方法)。
temp_df = df.head(10000)
但是当我这样做时,它会继续吃内存,我的 16 GB 内存总量会变为零,并且内核会死掉。
我该如何解决这个问题?任何帮助将不胜感激。
I am trying to perfrom some string manipulation on data (combined from 6 csvs) , of about 3.5GB+(combined csv size).
**
**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used : Dask**
Shape of Combined Df : 6 Million rows and 57 columns
**
I have a method that just eliminates unwanted characters from essential columns like:
def stripper(x):
try:
if type(x) != float or type(x) != pd._libs.missing.NAType:
x = re.sub(r"[^\w]+", "", x).upper()
except Exception as ex:
pass
return x
And I am applying above method to certain columns as ::
df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)
And also i am filling null values of a column with the values from another column as:
df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])
These are the two operation i need to perform and after these i am just doing .head() for getting value ( As dask work on lazy evaluation method).
temp_df = df.head(10000)
But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.
How can i solve this issue ?? Any help would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不熟悉 Dask,但在我看来,您可以对每个列使用
.str.replace
,而不是对每个行使用自定义函数< /em>,并寻求更加矢量化的解决方案:I'm not familiar with Dask, but it seems to me like you can use
.str.replace
for each column instead of a custom function for each row, and and go for a more vectorized solution:要扩展@richardec的解决方案,在Dask中您可以直接使用
DataFrame.replace
和Series.str.upper
,它应该比使用apply
更快。例如:最好知道您将 Dask DataFrame 分成了多少个分区 - 每个分区都应该适合内存(即 < 1GB),但您也不希望有太多(请参阅DataFrame 中的最佳实践Dask 文档)。
To expand on @richardec's solution, in Dask you can directly use
DataFrame.replace
andSeries.str.upper
, which should be faster than using anapply
. For example:It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (i.e. < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs).