“docker:19.03-dind”无法选择设备驱动程序“nvidia”具有功能:[[gpu]]

发布于 2025-01-20 21:22:20 字数 841 浏览 2 评论 0 原文

我收到了一个K8S+Dind问题:

  • 启动Kubernetes群集
  • 在运行请求GPU的作业时,在此群集中启动一个主Docker映像和一个DIND映像
  • ,GoT Orror 无法选择功能:[[GPU] ]

完整错误

http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")

exec to K8S POD内的DIND图像, nvidia-smi 不可用。

有些调试,看来这是由于dind缺少了nvidia-docker-toolkit,当我直接在本地笔记本电脑Docker上运行相同的工作时,我遇到了相同的错误,我通过安装 nvidia-docker2来解决相同的错误 sudo apt-get安装-y nvidia-docker2

我想也许我可以尝试将NVIDIA-DOCKER2安装到Dind 19.03(Docker:19.03-Dind),但不确定该怎么做?由多个舞台码头构建?

非常感谢!


更新:

POD规格:

spec:
    containers:
      - name: dind-daemon
        image: docker:19.03-dind

I got a K8S+DinD issue:

  • launch Kubernetes cluster
  • start a main docker image and a DinD image inside this cluster
  • when running a job requesting GPU, got error could not select device driver "nvidia" with capabilities: [[gpu]]

Full error

http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")

exec to the DinD image inside of K8S pod, nvidia-smi is not available.

Some debugging and it seems it's due to the DinD is missing the Nvidia-docker-toolkit, I had the same error when I ran the same job directly on my local laptop docker, I fixed the same error by installing nvidia-docker2 sudo apt-get install -y nvidia-docker2.

I'm thinking maybe I can try to install nvidia-docker2 to the DinD 19.03 (docker:19.03-dind), but not sure how to do it? By multiple stage docker build?

Thank you very much!


update:

pod spec:

spec:
    containers:
      - name: dind-daemon
        image: docker:19.03-dind

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

那伤。 2025-01-27 21:22:20

我自己工作。

Referring to

首先,我修改了ubuntu-dind映像()要安装nvidia-docker(即在dockerfile中将说明添加到Nvidia-docker网站中),然后将其更改为基于NVIDIA/CUDA:9.2-runtime-ubuntu16.04。


然后,我创建了一个带有两个容器的吊舱,一个前端Ubuntu容器和一个特权的Docker守护程序容器作为辅助设备。边车的图像是我上面提到的修改后的图像。

但是,由于这篇文章是从现在开始的3年前,所以我确实花了很多时间来匹配依赖项版本,在3年内回购迁移等。

我修改了Dockerfile的修改版本来构建它。

ARG CUDA_IMAGE=nvidia/cuda:11.0.3-runtime-ubuntu20.04
FROM ${CUDA_IMAGE}

ARG DOCKER_CE_VERSION=5:18.09.1~3-0~ubuntu-xenial


RUN apt-get update -q && \
    apt-get install -yq \
        apt-transport-https \
        ca-certificates \
        curl \
        gnupg-agent \
        software-properties-common && \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
    add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"  && \
    apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io

# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
    apt-get update -q && \
    apt-get install -yq \
        btrfs-progs \
        e2fsprogs \
        iptables \
        xfsprogs \
        xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
        pigz \
#        zfs \
        wget


# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
    && addgroup --system dockremap \
    && adduser --system -ingroup dockremap dockremap \
    && echo 'dockremap:165536:65536' >> /etc/subuid \
    && echo 'dockremap:165536:65536' >> /etc/subgid

# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT 37498f009d8bf25fbb6199e8ccd34bed84f2874b

RUN set -eux; \
    wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
    chmod +x /usr/local/bin/dind


##### Install nvidia docker #####
# Add the package repositories
RUN curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add --no-tty -

RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
    echo $distribution &&  \
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
      tee /etc/apt/sources.list.d/nvidia-docker.list

RUN apt-get update -qq --fix-missing

RUN apt-get install -yq nvidia-docker2

RUN sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json

RUN mkdir -p /usr/local/bin/
COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod 777 /usr/local/bin/dockerd-entrypoint.sh
RUN ln -s /usr/local/bin/dockerd-entrypoint.sh /

VOLUME /var/lib/docker
EXPOSE 2375

ENTRYPOINT ["dockerd-entrypoint.sh"]
#ENTRYPOINT ["/bin/sh", "/shared/dockerd-entrypoint.sh"]
CMD []

当我使用 exec 要登录到docker-indocker容器中,我可以成功运行 nvidia-smi (以前找不到的错误,然后无法运行任何与GPU资源相关的Docker Run)

欢迎您在<<<代码> BrandSight/dind:nvidia-docker

I got it working myself.

Referring to

First, I modified the ubuntu-dind image (https://github.com/billyteves/ubuntu-dind) to install nvidia-docker (i.e. added the instructions in the nvidia-docker site to the Dockerfile) and changed it to be based on nvidia/cuda:9.2-runtime-ubuntu16.04.

Then I created a pod with two containers, a frontend ubuntu container and the a privileged docker daemon container as a sidecar. The sidecar's image is the modified one I mentioned above.

But since this post is 3 year ago from now, I did spent quite some time to match up the dependencies versions, repo migration over 3 years, etc.

My modified version of Dockerfile to build it

ARG CUDA_IMAGE=nvidia/cuda:11.0.3-runtime-ubuntu20.04
FROM ${CUDA_IMAGE}

ARG DOCKER_CE_VERSION=5:18.09.1~3-0~ubuntu-xenial


RUN apt-get update -q && \
    apt-get install -yq \
        apt-transport-https \
        ca-certificates \
        curl \
        gnupg-agent \
        software-properties-common && \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
    add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"  && \
    apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io

# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
    apt-get update -q && \
    apt-get install -yq \
        btrfs-progs \
        e2fsprogs \
        iptables \
        xfsprogs \
        xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
        pigz \
#        zfs \
        wget


# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
    && addgroup --system dockremap \
    && adduser --system -ingroup dockremap dockremap \
    && echo 'dockremap:165536:65536' >> /etc/subuid \
    && echo 'dockremap:165536:65536' >> /etc/subgid

# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT 37498f009d8bf25fbb6199e8ccd34bed84f2874b

RUN set -eux; \
    wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
    chmod +x /usr/local/bin/dind


##### Install nvidia docker #####
# Add the package repositories
RUN curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add --no-tty -

RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
    echo $distribution &&  \
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
      tee /etc/apt/sources.list.d/nvidia-docker.list

RUN apt-get update -qq --fix-missing

RUN apt-get install -yq nvidia-docker2

RUN sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json

RUN mkdir -p /usr/local/bin/
COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod 777 /usr/local/bin/dockerd-entrypoint.sh
RUN ln -s /usr/local/bin/dockerd-entrypoint.sh /

VOLUME /var/lib/docker
EXPOSE 2375

ENTRYPOINT ["dockerd-entrypoint.sh"]
#ENTRYPOINT ["/bin/sh", "/shared/dockerd-entrypoint.sh"]
CMD []

When I use exec to login into the Docker-in-Docker container, I can successfully run nvidia-smi (which previously return not found error then cannot run any GPU resource related docker run)

Welcome to pull my image at brandsight/dind:nvidia-docker

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文