DASK内存泄漏解决方法

发布于 2025-01-28 01:27:35 字数 864 浏览 4 评论 0原文

当使用dask数据框时,我将获得“分布式。这发生在系统用完内存和交换之前。是否有解决方法,还是我做错了什么。我正在阅读的文件可以在 https:/lcb.app.app.app.app.app.page = 1 box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1 。您必须用大熊猫(Pandas)将其读取为将其保存为dask读取它的镶木quet文件。

from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client

from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')

client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client

sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()

#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()

When using the Dask dataframe where clause I get a "distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS" warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I'm reading can be found at https://lcb.app.box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1. You have to read it in with Pandas as save it as a parquet file for Dask to read it.

from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client

from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')

client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client

sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()

#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文