DASK内存泄漏解决方法
当使用dask数据框时,我将获得“分布式。这发生在系统用完内存和交换之前。是否有解决方法,还是我做错了什么。我正在阅读的文件可以在 https:/lcb.app.app.app.app.app.page = 1 box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1 。您必须用大熊猫(Pandas)将其读取为将其保存为dask读取它的镶木quet文件。
from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client
from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')
client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client
sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()
When using the Dask dataframe where clause I get a "distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS" warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I'm reading can be found at https://lcb.app.box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1. You have to read it in with Pandas as save it as a parquet file for Dask to read it.
from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client
from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')
client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client
sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论