rootless docker: nvidia-device-plugin is failing with crashloopbackoff · kubernetes/minikube#18952

(9 comments) (0 reactions) (0 assignees)Go (5,222 forks)batch import

area/gpuhelp wantedkind/bugkind/improvementkind/support

Repository metrics

Stars: (31,799 stars)
PR merge metrics: (Avg merge 12d 19h) (43 merged PRs in 30d)

Description

What Happened?

I am trying to create a minikube cluster with nvidia GPU using docker driver. I have followed all the instructions mentioned into docs. On Using GPU with docker container directly it works as shown below

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Thu May 23 19:32:58 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8               1W /  50W |      6MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

But when I try to create a minikube cluster with GPU support nvidia-device-plugin-daemonset pod is failing due to below error

failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown

Command I am using to create Cluster and output for the same

minikube start --docker-opt="default-ulimit=nofile=102400:102400" --profile gputest --driver docker --container-runtime docker --gpus all --cpus=4 --memory='20g' 
😄  [gputest] minikube v1.33.1 on Ubuntu 22.04
✨  Using the docker driver based on user configuration
📌  Using rootless Docker driver
👍  Starting "gputest" primary control-plane node in "gputest" cluster
🚜  Pulling base image v0.0.44 ...
🔥  Creating docker container (CPUs=4, Memory=20480MB) ...
🐳  Preparing Kubernetes v1.30.0 on Docker 26.1.1 ...
    ▪ opt default-ulimit=nofile=102400:102400
    ▪ Generating certificates and keys ...
    ▪ Booting up control plane ...
    ▪ Configuring RBAC rules ...
🔗  Configuring bridge CNI (Container Networking Interface) ...
🔎  Verifying Kubernetes components...
    ▪ Using image nvcr.io/nvidia/k8s-device-plugin:v0.15.0
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: nvidia-device-plugin, storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "gputest" cluster and "default" namespace by default

Attach the log file

minikube_logs.txt

Operating System

Ubuntu

Driver

Docker

Contributor guide

Research direction: Investigate nvidia device plugin error with rootless Docker: check cgroup permissions, try setting docker opt cgroupns=host, or switch to non rootless Docker driver.
Tech stack: godockerkubernetes
Domain: backendinfrastructure
Issue type: Bug
Difficulty: 3
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: DockerKubernetesNVIDIA GPU drivers
Newbie friendliness: 60