dask cudf中的内存用完

发布于 2025-02-10 07:43:01 字数 1877 浏览 0 评论 0原文

最近,我一直在尝试在我最近的项目中解决DASK_CUDF中的内存管理问题,但似乎我缺少一些东西,需要您的帮助。我正在使用15 GIB内存的Tesla T4 GPU。我有几个ETL步骤,但是最近GPU似乎在大多数方面都失败了(其中大多数只是过滤或转换步骤,但很少旋转改组)。我的数据包括大约20个500MB木板文件。对于这个特定问题,我将提供我用于过滤的代码,这使得GPU由于缺乏内存而失败。

我首先设置一个CUDA群集:

CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0")

cluster = LocalCUDACluster(
#     rmm_pool_size=get_rmm_size(0.6 * device_mem_size()),
    CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
    local_directory=os.path.join(WORKING_DIR, "dask-space"),
    device_memory_limit=parse_bytes("12GB")
)
client = Client(cluster)
client

取决于我是否提供rmm_pool_size参数,错误是不同的。提供参数后,我得到了最大池限制,否则我会收到以下错误: memoryError:std :: bad_alloc:cuda错误at:../ include/rmm/mr/mr/device/cuda_memory_resource.hpp:70:cudaerrormemoryallocation out memory

接下来,我创建一个我创建的过滤操作,我会在在数据上(旋转检查列中的值是否出现在包含80000个值的集合中):

def remove_invalid_values_filter_factory(valid_value_set_or_series):
    def f(df):
        mask = df['col'].isin(valid_value_set_or_series)
        return df.loc[mask]
    return f

# Load valid values from another file
valid_values_info_df = pd.read_csv(...)
# The series is around 1 MiB in size
keep_known_values_only = remove_invalid_values_filter_factory(valid_values_info_df['values'])
# Tried both and both cause the error
# keep_known_values_only = remove_invalid_values_filter_factory(set(valid_values_info_df['values']))

最后,我将此过滤器操作应用于数据并获得错误:

%%time
# Error occures during this processing step
keep_known_values_only(
    dask_cudf.read_parquet(...)
).to_parquet(...)

我感到完全丢失,我遇到的大多数源都有此错误使用无DASK或不设置CUDA群集的CUDF的结果,但我都有。此外,直觉上的过滤操作不应该是内存昂贵的,所以我不知道该怎么做。我认为如何设置群集有问题,修复它将使其余的内存更昂贵的操作也希望也可以正常工作。

感谢您的帮助,谢谢!

I've been trying to solve memory management issues in dask_cudf in my recent project for quite some time recently, but it seems I'm missing something and I need your help. I am working on Tesla T4 GPU with 15 GiB memory. I have several ETL steps but the GPU recently seems to be failing on most of them (most of them are just filtering or transformation steps, but few revolve shuffling). My data consists of around 20 500MB parquet files. For this specific question I will provide a piece of code I use for filtering which makes the GPU fail due to lack of memory.

I start by setting up a CUDA cluster:

CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0")

cluster = LocalCUDACluster(
#     rmm_pool_size=get_rmm_size(0.6 * device_mem_size()),
    CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
    local_directory=os.path.join(WORKING_DIR, "dask-space"),
    device_memory_limit=parse_bytes("12GB")
)
client = Client(cluster)
client

Depending whether I provide rmm_pool_size parameter the error is different. When the parameter is provided I get that Maximum pool limit is exceeded and otherwise I get the following error:
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Next, I create a filtering operation I intend to perform on data (which revolves checking whether a value in a column appears in a set containing around 80000 values):

def remove_invalid_values_filter_factory(valid_value_set_or_series):
    def f(df):
        mask = df['col'].isin(valid_value_set_or_series)
        return df.loc[mask]
    return f

# Load valid values from another file
valid_values_info_df = pd.read_csv(...)
# The series is around 1 MiB in size
keep_known_values_only = remove_invalid_values_filter_factory(valid_values_info_df['values'])
# Tried both and both cause the error
# keep_known_values_only = remove_invalid_values_filter_factory(set(valid_values_info_df['values']))

Finally I apply this filter operation on the data and get the error:

%%time
# Error occures during this processing step
keep_known_values_only(
    dask_cudf.read_parquet(...)
).to_parquet(...)

I feel totally lost, most sources I came across have this error as a result of using cuDF without Dask or not setting CUDA cluster, but I have both. Additionally, intuitively the filtering operation shouldn't be memory expensive, so I have no clue what to do. I assume there is something wrong with how I set up the cluster, and fixing it would make the rest of more memory expensive operations hopefully work as well.

I would be grateful for your help, thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

桜花祭 2025-02-17 07:43:01

我将使用dask-sql来利用它的能力摆脱核心处理。

至于dask_cudf函数失败,请在CUDF存储库中以最低可重现的方式提出问题!我们会很感激! :)

除非您真的需要并知道自己在做什么,否则您可能不想一起做Dask_cudf和RMM(就像Rapids超级使用模式一样,当您需要真正最大化用于算法的GPU大小时)。如果您使用的使用呼叫(并且似乎在这里使用镶木quet文件,这就是为什么我没有深入研究),它确实可以提供帮助。

I'd use dask-sql for this to take advantage of it's ability to do out of core processing.

As for the dask_cudf functions failing, please make an issue in the cudf repo with a minimum reproducible! We'd appreciate it! :)

You may not want to do dask_cudf and RMM together unless you really have to and know what you're doing (that's like RAPIDS super use mode, when you need to really maximize GPU size used for an algo). If your use calls for that (and it doesn't seem to here as you're using parquet files, which is why I'm not deep diving into it), it can really help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文