GPU上的XGBoost会为微小内存请求提供内存分配错误

发布于 2025-02-10 23:20:44 字数 4873 浏览 2 评论 0原文

我正在尝试使用CUDA Toolkit 11.3在NVIDIA TESLA K80 GPU上通过Python Scikit-Learn接口运行XGBoost 1.6.1。即使GPU有大量可用的内存,下面的示例代码也会引发内存分配错误。

from xgboost import XGBClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.30,
                                                    random_state=0)
classifier = XGBClassifier(tree_method='gpu_hist')
model = classifier.fit(X_train, y_train)

和追溯(出于隐私原因给出的相对路径):

Traceback (most recent call last):
  File "~/.config/JetBrains/PyCharmCE2022.1/scratches/scratch.py", line 11, in <module>
    model = classifier.fit(X_train, y_train)
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
  File "~/lib/python3.8/site-packages/xgboost/sklearn.py", line 1400, in fit
    self._Booster = train(
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
  File "~/lib/python3.8/site-packages/xgboost/training.py", line 181, in train
    bst.update(dtrain, i, obj)
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 1733, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 203, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [07:31:30] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: [07:31:30] ../src/c_api/../data/../common/common.h:46: ../src/common/device_helpers.cuh: 447: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Stack trace:
  [bt] (0) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x38f939) [0x7f363eb54939]
  [bt] (1) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3938d3) [0x7f363eb588d3]
  [bt] (2) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3d39ae) [0x7f363eb989ae]
  [bt] (3) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e7914) [0x7f363ebac914]
  [bt] (4) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e9790) [0x7f363ebae790]
  [bt] (5) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x57d309) [0x7f363ed42309]
  [bt] (6) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x20fca8) [0x7f363e9d4ca8]
  [bt] (7) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7f363e86e688]
  [bt] (8) /lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f36797b08ee]


- Free memory: 11841830912
- Requested memory: 496

Stack trace:
  [bt] (0) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x38f939) [0x7f363eb54939]
  [bt] (1) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x393d4b) [0x7f363eb58d4b]
  [bt] (2) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3d3ae9) [0x7f363eb98ae9]
  [bt] (3) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e7914) [0x7f363ebac914]
  [bt] (4) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e9790) [0x7f363ebae790]
  [bt] (5) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x57d309) [0x7f363ed42309]
  [bt] (6) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x20fca8) [0x7f363e9d4ca8]
  [bt] (7) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7f363e86e688]
  [bt] (8) /lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f36797b08ee]



Process finished with exit code 1

如您所见,免费内存远远超过了所需的内存。运行计算 - 启动器给出了大量输出,但是我认为相关部分是在此处提取的:

> cat sanitizer.out | grep Program
========= Program hit invalid device function (error 98) on CUDA API call to cudaFuncGetAttributes.
========= Program hit invalid device function (error 98) on CUDA API call to cudaGetLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaLaunchKernel.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaPeekAtLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaPeekAtLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaGetLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaLaunchKernel.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaGetLastError.

请有什么线索吗?

I'm attempting to run XGBoost 1.6.1 through the Python scikit-learn interface on an NVIDIA Tesla K80 GPU using CUDA Toolkit 11.3. The following example code throws a memory allocation error, even though the GPU has plenty of memory available.

from xgboost import XGBClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.30,
                                                    random_state=0)
classifier = XGBClassifier(tree_method='gpu_hist')
model = classifier.fit(X_train, y_train)

And the traceback (relative paths given for privacy reasons):

Traceback (most recent call last):
  File "~/.config/JetBrains/PyCharmCE2022.1/scratches/scratch.py", line 11, in <module>
    model = classifier.fit(X_train, y_train)
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
  File "~/lib/python3.8/site-packages/xgboost/sklearn.py", line 1400, in fit
    self._Booster = train(
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
  File "~/lib/python3.8/site-packages/xgboost/training.py", line 181, in train
    bst.update(dtrain, i, obj)
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 1733, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
  File "~/lib/python3.8/site-packages/xgboost/core.py", line 203, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [07:31:30] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: [07:31:30] ../src/c_api/../data/../common/common.h:46: ../src/common/device_helpers.cuh: 447: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Stack trace:
  [bt] (0) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x38f939) [0x7f363eb54939]
  [bt] (1) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3938d3) [0x7f363eb588d3]
  [bt] (2) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3d39ae) [0x7f363eb989ae]
  [bt] (3) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e7914) [0x7f363ebac914]
  [bt] (4) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e9790) [0x7f363ebae790]
  [bt] (5) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x57d309) [0x7f363ed42309]
  [bt] (6) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x20fca8) [0x7f363e9d4ca8]
  [bt] (7) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7f363e86e688]
  [bt] (8) /lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f36797b08ee]


- Free memory: 11841830912
- Requested memory: 496

Stack trace:
  [bt] (0) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x38f939) [0x7f363eb54939]
  [bt] (1) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x393d4b) [0x7f363eb58d4b]
  [bt] (2) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3d3ae9) [0x7f363eb98ae9]
  [bt] (3) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e7914) [0x7f363ebac914]
  [bt] (4) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x3e9790) [0x7f363ebae790]
  [bt] (5) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x57d309) [0x7f363ed42309]
  [bt] (6) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x20fca8) [0x7f363e9d4ca8]
  [bt] (7) ~/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7f363e86e688]
  [bt] (8) /lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f36797b08ee]



Process finished with exit code 1

As you can see, the free memory far exceeds the requested memory. Running in compute-sanitizer gives a massive output, but I think the relevant part is extracted here:

> cat sanitizer.out | grep Program
========= Program hit invalid device function (error 98) on CUDA API call to cudaFuncGetAttributes.
========= Program hit invalid device function (error 98) on CUDA API call to cudaGetLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaLaunchKernel.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaPeekAtLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaPeekAtLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaGetLastError.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaLaunchKernel.
========= Program hit no kernel image is available for execution on the device (error 209) on CUDA API call to cudaGetLastError.

Any clues please?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

爱你不解释 2025-02-17 23:20:44

来自 xgboost讨论列表

@hcho3
如果您打算使用旧卡,则需要从源安装XGBoost。通常的安装方法(PIP)仅支持5.2及以上。让我更新安装文档以指示此信息。

From the XGBoost discussion list

@hcho3
You will need to install XGBoost from the source if you intend to use an old card. The usual method of install (pip) only supports 5.2 and up. Let me update the install doc to indicate this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文