kubernetes-sigs/kubespray

cilium_operator_image_repo references wrong image name for offline registry sync (`cilium/operator` vs chart-required `cilium/operator-generic`)

Open

#13252 opened on May 12, 2026

View on GitHub
 (5 comments) (2 reactions) (1 assignee)HTML (10,380 stars) (4,419 forks)batch import
Ubuntu 24help wantedkind/bugtriage/accepted

Description

What happened?

Environment

  • Kubespray: master branch, commit bdbfcaae8 (v2.30.0-94), bug also present on origin/master HEAD
  • Cilium: 1.19.2 chart and image
  • cilium-cli: v0.18.9
  • K8s: 1.34.4 with containerd
  • Deployment mode: offline registry (dockerhub_image_repo set)

Summary

Kubespray syncs the cilium/operator image to offline registries, but the Cilium Helm chart requires cilium/operator-generic for non-cloud deployments. This mismatch causes broken deployments for offline-registry users.

Root cause

Cilium chart image naming convention

In cilium/templates/cilium-operator/_helpers.tpl, the chart constructs the operator image name as:

{repository}-{cloud}{suffix}{tag}{digest}

where {cloud} is determined by cilium.operator.cloud define:

  • aws if eni.enabled
  • azure if azure.enabled
  • alibabacloud if alibabacloud.enabled
  • generic otherwise (default for non-cloud, including bare-metal) So the rendered image for a non-cloud deployment with default values is:
quay.io/cilium/operator + "-" + "generic" + "" + ":v1.19.2"
= quay.io/cilium/operator-generic:v1.19.2

The same {cloud} variable is also used in the deployment's command:

command:
- cilium-operator-{{ include "cilium.operator.cloud" . }}

Kubespray mismatch

roles/kubespray_defaults/defaults/main/download.yml:237:

cilium_operator_image_repo: "{{ quay_image_repo }}/cilium/operator"

This value is used in two places:

  1. download.yml:599 (image sync entry) — Kubespray pulls quay.io/cilium/operator:vX.Y.Z and pushes it to the offline registry as <registry>/cilium/operator:vX.Y.Z.
  2. roles/network_plugin/cilium/templates/values.yaml.j2:154-157 — rendered to chart values as:
    operator:
      image:
        repository: <registry>/cilium/operator
        tag: vX.Y.Z
    

The chart then applies its helper logic and ends up requesting image <registry>/cilium/operator-generic:vX.Y.Z — a name not synced to the offline registry.

Why online-registry users don't hit this

Online users get quay.io/cilium/operator-generic:vX.Y.Z directly (chart default repository + chart helper). Kubespray's override of the repository field reuses the same cilium/operator base, so the chart helper still produces a valid name in quay.io/cilium/operator-generic, which exists upstream.

Why this hasn't been reported

Offline-registry users typically have an image sync workflow that inadvertently masks the bug:

  • The sync sees cilium/operator:vX.Y.Z in Kubespray's list
  • The pull from upstream succeeds (this name exists for cloud variants build base)
  • Some sync scripts auto-retag the pulled image to additional aliases including cilium/operator-generic
  • Cluster pulls cilium/operator-generic successfully — but gets the wrong image content (contains cilium-operator binary, not cilium-operator-generic)
  • For some Cilium versions/cilium-cli combinations, the resulting deployment's command happens to match the wrong binary, and the pod runs — appearing as a working deployment This silent failure mode can persist until a chart upgrade synchronizes the deployment's command field with the chart's expectation (cilium-operator-generic), at which point new pods CrashLoopBackOff:
exec: "cilium-operator-generic": executable file not found in $PATH

Reproduction

  1. Configure Kubespray for offline registry:
    dockerhub_image_repo: "<registry>/kubespray"
    
  2. Use Kubespray's image sync (without any extra retagging) to populate the offline registry from cilium_image_list.
  3. Verify only cilium/operator (not cilium/operator-generic) ends up in the offline registry:
    curl -s "https://<registry>/v2/kubespray/cilium/operator-generic/tags/list"
    # 404 or empty
    
  4. Deploy cilium:
    ansible-playbook cluster.yml -i inventory/... --tags cilium
    
  5. Observe operator pods CrashLoopBackOff with:
    Failed to pull image "<registry>/kubespray/cilium/operator-generic:vX.Y.Z": 
    manifest unknown
    

Evidence

Available on request. Key data points:

  • helm template rendering with default values produces quay.io/cilium/operator-generic:v1.19.2
  • helm template with --set operator.image.repository=<custom>/cilium/operator still produces <custom>/cilium/operator-generic:v1.19.2 (chart helper unconditionally adds -{cloud} suffix)
  • A real offline-registry deployment ended up with three different ReplicaSet specs over several upgrades; the only Ready one used image=operator + command=cilium-operator (matching by luck), while the deterministic chart output (image=operator-generic + command=cilium-operator-generic) consistently failed

What did you expect to happen?

cilium-operator deployment should be Running with image and command fields that match what the offline registry contains. Specifically:

Kubespray should sync the upstream image with the same name the chart's helper will compute (cilium/operator-generic for non-cloud deployments), OR Kubespray should pass operator.image.override to the chart so the chart skips the helper's suffix logic and uses the explicit image name Kubespray has synced.

In either case, the final deployed image and command should be consistent and point to a valid binary inside the image.

How can we reproduce it (as minimally and precisely as possible)?

  1. Configure Kubespray for offline registry deployment:

    # inventory/<cluster>/group_vars/all/offline.yml
    dockerhub_image_repo: "<your-registry>/kubespray"
    quay_image_repo: "{{ dockerhub_image_repo }}"
    
  2. Sync images to your offline registry using whatever workflow you have (typically pulling from quay.io and pushing to your private registry). Do not auto-retag — only push the exact image name in cilium_image_list.

  3. Verify the offline registry only contains cilium/operator (the name Kubespray syncs):

    curl -s "https://<registry>/v2/kubespray/cilium/operator/tags/list"
    # {"name":"kubespray/cilium/operator","tags":["v1.19.2"]}
    
    curl -s "https://<registry>/v2/kubespray/cilium/operator-generic/tags/list"  
    # {"errors":[{"code":"NAME_UNKNOWN", ...}]}
    
  4. Deploy with:

    kube_network_plugin: cilium
    cilium_version: 1.19.2
    
    ansible-playbook -i inventory/<cluster>/hosts.yaml cluster.yml --tags cilium
    
  5. Observe cilium-operator pods CrashLoopBackOff with ErrImagePull or exec: "cilium-operator-generic": executable file not found.

  6. Verify root cause with helm template:

    helm template cilium <cilium-1.19.2-chart-path> \
      --namespace kube-system \
      --set operator.image.repository=<registry>/kubespray/cilium/operator \
      --set operator.image.tag=v1.19.2 \
      --set operator.image.useDigest=false \
      | grep -A2 "name: cilium-operator$"
    # Shows image: <registry>/kubespray/cilium/operator-generic:v1.19.2 (chart added -generic)
    

OS

Ubuntu 24

Version of Ansible

ansible [core 2.18.12]

Version of Python

python version = 3.12.4 (main, Jul 5 2024, 11:37:28) [GCC 9.4.0] (/usr/local/python3.12/bin/python3.12)

Version of Kubespray (commit)

commit bdbfcaae847ab0f2adcb3b420b12b1c5b4baffce tag: v2.30.0-94-gbdbfcaae8 (master branch, 94 commits after v2.30.0)

Network plugin used

cilium

Full inventory with variables

https://gist.github.com/Feelings0220/e531c5a94af04ecbc279314086cdfd45

Command used to invoke ansible

ansible-playbook -i inventory//hosts.yaml \ cluster.yml \ -b \ --become-user=root \ -e kube_version=v1.34.4 \ -e cilium_version=1.19.2

Output of ansible run

Welcome to Ubuntu 24.04.4 LTS (GNU/Linux 6.8.0-100-generic x86_64)

System information as of Tue May 12 02:57:30 PM CST 2026

System load: 0.45 Processes: 840 Usage of /home: 1.9% of 10.00TB Users logged in: 1 Memory usage: 2% IPv4 address for ens1f0: 10.8.9.168 Swap usage: 0% IPv4 address for ens1f0: 10.8.9.150 Temperature: 70.0 C

Expanded Security Maintenance for Applications is not enabled.

127 updates can be applied immediately. To see these additional updates run: apt list --upgradable

Enable ESM Apps to receive additional future security updates. See https://ubuntu.com/esm or run: sudo pro status

Failed to connect to https://changelogs.ubuntu.com/meta-release-lts. Check your Internet connection or proxy settings

=== cilium status (after upgrade) === /¯¯
/¯¯_/¯¯\ Cilium: OK _/¯¯_/ Operator: 2 errors /¯¯_/¯¯\ Envoy DaemonSet: OK _/¯¯_/ Hubble Relay: 1 errors, 2 warnings __/ ClusterMesh: disabled

DaemonSet cilium Desired: 8, Ready: 8/8, Available: 8/8 DaemonSet cilium-envoy Desired: 8, Ready: 8/8, Available: 8/8 Deployment cilium-operator Desired: 3, Ready: 1/3, Available: 1/3, Unavailable: 2/3 Deployment hubble-relay Desired: 1, Unavailable: 1/1 Containers: cilium Running: 8 cilium-envoy Running: 8 cilium-operator Running: 3 clustermesh-apiserver
hubble-relay Pending: 1 Cluster Pods: 45/45 managed by Cilium Helm chart version: 1.19.2 Image versions cilium dockerhub.kubekey.local/kubernetes-kubespray/cilium/cilium:v1.19.2: 8 cilium-envoy dockerhub.kubekey.local/kubernetes-kubespray/cilium/cilium-envoy:v1.34.10-1762597008-ff7ae7d623be00078865cff1b0672cc5d9bfc6d5: 8 cilium-operator dockerhub.kubekey.local/kubernetes-kubespray/cilium/operator:v1.19.2: 3 hubble-relay dockerhub.kubekey.local/kubernetes-kubespray/cilium/hubble-relay:v1.19.2@sha256:9987c73bad48c987fd065185535fd15a6717cbe8a8caf7fc7ef0413532cf490e: 1 Errors: cilium-operator cilium-operator 2 pods of Deployment cilium-operator are not ready cilium-operator cilium-operator deployment cilium-operator is rolling out - 2 out of 3 pods updated hubble-relay hubble-relay 1 pods of Deployment hubble-relay are not ready Warnings: hubble-relay hubble-relay-755c6b7747-xj4rw pod is pending hubble-relay hubble-relay-755c6b7747-xj4rw pod is pending

=== cilium-operator pods === NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cilium-operator-55bb5d64cc-j2kz6 1/1 Running 1 (30h ago) 30h 10.8.9.95 worker-a-03 cilium-operator-5d5878f8fb-pxbhn 0/1 CrashLoopBackOff 329 (28s ago) 27h 10.8.9.169 master-03 cilium-operator-5d5878f8fb-s4qrn 0/1 CrashLoopBackOff 327 (115s ago) 27h 10.8.9.94 worker-a-02

=== Most recent crash pod logs (last 30 lines) === Crash pod: pod/cilium-operator-5d5878f8fb-pxbhn pod/cilium-operator-5d5878f8fb-s4qrn /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h8m79 (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: cilium-config-path: Type: ConfigMap (a volume populated by a ConfigMap) Name: cilium-config Optional: false kube-api-access-h8m79: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt Optional: false DownwardAPI: true QoS Class: BestEffort Node-Selectors: kubernetes.io/os=linux Tolerations: op=Exists node.cilium.io/agent-not-ready op=Exists Events: Type Reason Age From Message


Normal Created 53m (x318 over 27h) kubelet Created container: cilium-operator Warning Failed 53m (x318 over 27h) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "cilium-operator-generic": executable file not found in $PATH Warning BackOff 3m15s (x8043 over 27h) kubelet Back-off restarting failed container cilium-operator in pod cilium-operator-5d5878f8fb-s4qrn_kube-system(ffd8a1c5-6ede-4218-a84e-1edf44318473) Normal Pulled 117s (x328 over 27h) kubelet Container image "dockerhub.kubekey.local/kubernetes-kubespray/cilium/operator:v1.19.2" already present on machine

=== Image present in offline registry === dockerhub.kubekey.local/kubernetes-kubespray/cilium/operator-generic v1.19.1 f1b5c176c6ee8 33.4MB dockerhub.kubekey.local/kubernetes-kubespray/cilium/operator-generic v1.19.2 63ae62180908e 45.7MB dockerhub.kubekey.local/kubernetes-kubespray/cilium/operator v1.19.2 63ae62180908e 45.7MB dockerhub.kubekey.local/kubernetes-kubespray/cilium/operator v1.19.1 e5091458a7e48 45.6MB user1@sz-bianyi-112:~/mao.wei11/kubespray-deploy/cilium$

Anything else we need to know

Additional context — Proposed fix If maintainers confirm this is a real issue, I can submit a PR with the following change: File 1: roles/kubespray_defaults/defaults/main/download.yml:237 diff- cilium_operator_image_repo: "{{ quay_image_repo }}/cilium/operator"

  • cilium_operator_image_repo: "{{ quay_image_repo }}/cilium/operator-generic" File 2: roles/network_plugin/cilium/templates/values.yaml.j2:154-157 diffoperator: image:
  • repository: {{ cilium_operator_image_repo }}
  • override: "{{ cilium_operator_image_repo }}:{{ cilium_operator_image_tag }}" tag: {{ cilium_operator_image_tag }} Using operator.image.override prevents the chart helper from adding another -generic suffix (since cilium_operator_image_repo already ends in -generic). Cloud variant considerations The current default targets only non-cloud (generic) deployments, matching the most common Kubespray scenario. For users deploying with eni.enabled, azure.enabled, or alibabacloud.enabled, the fix would need to be conditionalized. Happy to extend the PR if maintainers prefer.

Contributor guide