rootless docker: nvidia-device-plugin is failing with crashloopbackoff
#18,952 opened on May 23, 2024
Description
What Happened?
I am trying to create a minikube cluster with nvidia GPU using docker driver. I have followed all the instructions mentioned into docs. On Using GPU with docker container directly it works as shown below
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Thu May 23 19:32:58 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 1W / 50W | 6MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
But when I try to create a minikube cluster with GPU support nvidia-device-plugin-daemonset pod is failing due to below error
failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown
Command I am using to create Cluster and output for the same
minikube start --docker-opt="default-ulimit=nofile=102400:102400" --profile gputest --driver docker --container-runtime docker --gpus all --cpus=4 --memory='20g'
😄 [gputest] minikube v1.33.1 on Ubuntu 22.04
✨ Using the docker driver based on user configuration
📌 Using rootless Docker driver
👍 Starting "gputest" primary control-plane node in "gputest" cluster
🚜 Pulling base image v0.0.44 ...
🔥 Creating docker container (CPUs=4, Memory=20480MB) ...
🐳 Preparing Kubernetes v1.30.0 on Docker 26.1.1 ...
▪ opt default-ulimit=nofile=102400:102400
▪ Generating certificates and keys ...
▪ Booting up control plane ...
▪ Configuring RBAC rules ...
🔗 Configuring bridge CNI (Container Networking Interface) ...
🔎 Verifying Kubernetes components...
▪ Using image nvcr.io/nvidia/k8s-device-plugin:v0.15.0
▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟 Enabled addons: nvidia-device-plugin, storage-provisioner, default-storageclass
🏄 Done! kubectl is now configured to use "gputest" cluster and "default" namespace by default
Attach the log file
Operating System
Ubuntu
Driver
Docker