停止并启动深度学习虚拟机后找不到 NVIDIA 驱动程序
[TL;DR] 首先,等待几分钟并检查 Nvidia 驱动程序是否开始正常工作。如果没有,请停止并重新启动 VM 实例。
我使用 A100 GPU 创建了一个深度学习虚拟机(Google 点击部署)。停止并启动实例后,当我运行 nvidia-smi
时,我收到以下错误消息:
NVIDIA-SMI 失败,因为它无法与 NVIDIA 驱动程序通信。确保已安装并运行最新的 NVIDIA 驱动程序。
但是如果我输入 which nvidia-smi
,我得到
/usr/bin/nvidia-smi
似乎驱动程序在那里,但无法使用。 有人可以建议如何在停止和启动深度学习虚拟机后启用 NVIDIA 驱动程序吗?我第一次创建并打开实例时,会自动安装驱动程序。
系统信息是(使用uname -m && cat /etc/*release
):
x86_64
PRETTY_NAME="Debian GNU/Linux 10(破坏者)"
名称=“Debian GNU/Linux”
VERSION_ID="10"
版本=“10(毁灭者)”
VERSION_CODENAME=破坏者
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
我尝试了
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
然后运行,
sudo python3 install_gpu_driver.py
它会给出以下消息:
执行:which nvidia-smi
/usr/bin/nvidia-smi
已经安装。
[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.
I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi
, I got the following error message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
But if I type which nvidia-smi
, I got
/usr/bin/nvidia-smi
It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.
The system information is (using uname -m && cat /etc/*release
):
x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
I tried the installation scripts from GCP. First run
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
And then run
sudo python3 install_gpu_driver.py
which gives the following message:
Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
发布问题后,Nvidia 驱动程序在等待几分钟后开始正常工作。
在接下来的几天里,我多次尝试停止/启动虚拟机实例。有时
nvidia-smi
直接工作,有时等待超过20分钟后仍无法工作。我目前对这个问题的最佳答案是先等待几分钟。如果nvidia-smi
仍然无法工作,请停止并重新启动实例。After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.
In the following days, I tried stopping/starting the VM instance multiple times. Sometimes
nvidia-smi
directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. Ifnvidia-smi
still does not work, stop and start the instance again.也遇到了这个问题。如果它对某人有帮助,请运行以下命令 [1] 为我们修复它:
这是在 Debian 11 上。
日志
also ran into this issue. if it helps someone, running following command [1] fixed it for us:
this was on debian 11.
log
对我有用的(不确定下次启动是否会顺利进行)是删除所有驱动程序:
sudo apt remove --purge '*nvidia*'
,然后使用强制安装sudo python3 install_gpu_driver.py
。在
install_gpu_driver.py
中,将check_driver_installed
函数内的第 230 行更改为return False
。然后,运行脚本。使用 docker 的人可能会遇到此错误 docker: Error response from daemon: Could not select device driver "" with features: [[gpu]] 并且也必须重新安装 docker。这个线程帮助了我。
What worked for me (not sure if it will go well to next starts) was to remove all drivers:
sudo apt remove --purge '*nvidia*'
, and then force the installation withsudo python3 install_gpu_driver.py
.In the
install_gpu_driver.py
, change line 230 toreturn False
inside of thecheck_driver_installed
function. Then, run the script.Who uses docker may face this error
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
and have to reinstall the docker too. This thread helped me.