当前位置：文江博客话题详情

停止并启动深度学习虚拟机后找不到 NVIDIA 驱动程序

发布于 2025-01-17 00:03:42 字数 1316 浏览 0 评论 0原文

[TL;DR] 首先，等待几分钟并检查 Nvidia 驱动程序是否开始正常工作。如果没有，请停止并重新启动 VM 实例。

我使用 A100 GPU 创建了一个深度学习虚拟机（Google 点击部署）。停止并启动实例后，当我运行 nvidia-smi 时，我收到以下错误消息：

NVIDIA-SMI 失败，因为它无法与 NVIDIA 驱动程序通信。确保已安装并运行最新的 NVIDIA 驱动程序。

但是如果我输入 which nvidia-smi，我得到

/usr/bin/nvidia-smi

似乎驱动程序在那里，但无法使用。 有人可以建议如何在停止和启动深度学习虚拟机后启用 NVIDIA 驱动程序吗？我第一次创建并打开实例时，会自动安装驱动程序。

系统信息是（使用uname -m && cat /etc/*release）：

x86_64
PRETTY_NAME="Debian GNU/Linux 10（破坏者）"
名称=“Debian GNU/Linux”
VERSION_ID="10"
版本=“10（毁灭者）”
VERSION_CODENAME=破坏者
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

我尝试了

curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py

然后运行，

sudo python3 install_gpu_driver.py

它会给出以下消息：

执行：which nvidia-smi
/usr/bin/nvidia-smi
已经安装。

原文

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.

I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

But if I type which nvidia-smi, I got

/usr/bin/nvidia-smi

It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.

The system information is (using uname -m && cat /etc/*release):

x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

I tried the installation scripts from GCP. First run

curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py

And then run

sudo python3 install_gpu_driver.py

which gives the following message:

Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

简单气质女生网名 2025-01-24 00:03:42

发布问题后，Nvidia 驱动程序在等待几分钟后开始正常工作。

在接下来的几天里，我多次尝试停止/启动虚拟机实例。有时nvidia-smi直接工作，有时等待超过20分钟后仍无法工作。我目前对这个问题的最佳答案是先等待几分钟。如果 nvidia-smi 仍然无法工作，请停止并重新启动实例。

回复收藏 0 原文

征棹 2025-01-24 00:03:42

也遇到了这个问题。如果它对某人有帮助，请运行以下命令 [1] 为我们修复它：

$ sudo apt-get install linux-headers-`uname -r`

这是在 Debian 11 上。

日志

also ran into this issue. if it helps someone, running following command [1] fixed it for us:

$ sudo apt-get install linux-headers-`uname -r`

this was on debian 11.

log

回复收藏 0 原文

失与倦＂ 2025-01-24 00:03:42

对我有用的（不确定下次启动是否会顺利进行）是删除所有驱动程序：sudo apt remove --purge '*nvidia*'，然后使用 强制安装sudo python3 install_gpu_driver.py。

在 install_gpu_driver.py 中，将 check_driver_installed 函数内的第 230 行更改为 return False。然后，运行脚本。

使用 docker 的人可能会遇到此错误 docker: Error response from daemon: Could not select device driver "" with features: [[gpu]] 并且也必须重新安装 docker。这个线程帮助了我。

回复收藏 0 原文

~没有更多了~

关于作者

满地尘埃落定

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

停止并启动深度学习虚拟机后找不到 NVIDIA 驱动程序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

停止并启动深度学习虚拟机后找不到 NVIDIA 驱动程序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。