DASK内存泄漏解决方法

发布于 2025-01-28 01:27:35 字数 864 浏览 4 评论 0原文

当使用dask数据框时，我将获得“分布式。这发生在系统用完内存和交换之前。是否有解决方法，还是我做错了什么。我正在阅读的文件可以在 https:/lcb.app.app.app.app.app.page = 1 box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1 。您必须用大熊猫（Pandas）将其读取为将其保存为dask读取它的镶木quet文件。

from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client

from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')

client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client

sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()

#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()

原文

When using the Dask dataframe where clause I get a "distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS" warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I'm reading can be found at https://lcb.app.box.com/s/e89t59s0yb558tjoncjsid710oirqbgd?page=1. You have to read it in with Pandas as save it as a parquet file for Dask to read it.

from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client

from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')

client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client

sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()

#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()

分享到QQ

分享到微博