停止并启动深度学习虚拟机后找不到 NVIDIA 驱动程序

发布于 2025-01-17 00:03:42 字数 1316 浏览 0 评论 0原文

[TL;DR] 首先,等待几分钟并检查 Nvidia 驱动程序是否开始正常工作。如果没有,请停止并重新启动 VM 实例。

我使用 A100 GPU 创建了一个深度学习虚拟机(Google 点击部署)。停止并启动实例后,当我运行 nvidia-smi 时,我收到以下错误消息:

NVIDIA-SMI 失败,因为它无法与 NVIDIA 驱动程序通信。确保已安装并运行最新的 NVIDIA 驱动程序。

但是如果我输入 which nvidia-smi,我得到

/usr/bin/nvidia-smi

似乎驱动程序在那里,但无法使用。 有人可以建议如何在停止和启动深度学习虚拟机后启用 NVIDIA 驱动程序吗?我第一次创建并打开实例时,会自动安装驱动程序。

系统信息是(使用uname -m && cat /etc/*release):

x86_64
PRETTY_NAME="Debian GNU/Linux 10(破坏者)"
名称=“Debian GNU/Linux”
VERSION_ID="10"
版本=“10(毁灭者)”
VERSION_CODENAME=破坏者
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

我尝试了

curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py

然后运行,

sudo python3 install_gpu_driver.py

它会给出以下消息:

执行:which nvidia-smi
/usr/bin/nvidia-smi
已经安装。

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.

I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

But if I type which nvidia-smi, I got

/usr/bin/nvidia-smi

It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.

The system information is (using uname -m && cat /etc/*release):

x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

I tried the installation scripts from GCP. First run

curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py

And then run

sudo python3 install_gpu_driver.py

which gives the following message:

Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

简单气质女生网名 2025-01-24 00:03:42

发布问题后,Nvidia 驱动程序在等待几分钟后开始正常工作。

在接下来的几天里,我多次尝试停止/启动虚拟机实例。有时nvidia-smi直接工作,有时等待超过20分钟后仍无法工作。我目前对这个问题的最佳答案是先等待几分钟。如果 nvidia-smi 仍然无法工作,请停止并重新启动实例。

After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.

In the following days, I tried stopping/starting the VM instance multiple times. Sometimes nvidia-smi directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. If nvidia-smi still does not work, stop and start the instance again.

征棹 2025-01-24 00:03:42

也遇到了这个问题。如果它对某人有帮助,请运行以下命令 [1] 为我们修复它:

$ sudo apt-get install linux-headers-`uname -r`

这是在 Debian 11 上。

日志

also ran into this issue. if it helps someone, running following command [1] fixed it for us:

$ sudo apt-get install linux-headers-`uname -r`

this was on debian 11.

log

失与倦" 2025-01-24 00:03:42

对我有用的(不确定下次启动是否会顺利进行)是删除所有驱动程序:sudo apt remove --purge '*nvidia*',然后使用 强制安装sudo python3 install_gpu_driver.py

install_gpu_driver.py 中,将 check_driver_installed 函数内的第 230 行更改为 return False。然后,运行脚本。

使用 docker 的人可能会遇到此错误 docker: Error response from daemon: Could not select device driver "" with features: [[gpu]] 并且也必须重新安装 docker。这个线程帮助了我。

What worked for me (not sure if it will go well to next starts) was to remove all drivers: sudo apt remove --purge '*nvidia*', and then force the installation with sudo python3 install_gpu_driver.py.

In the install_gpu_driver.py, change line 230 to return False inside of the check_driver_installed function. Then, run the script.

Who uses docker may face this error docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] and have to reinstall the docker too. This thread helped me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文