(2 Kommentare) (0 Reaktionen) (1 zugewiesene Person)Go (39.670 Stars) (2.263 Forks)batch import

help wanted

Beschreibung

What is not working as documented?

TensorFlow GPU acceleration does not appear to work in PhotoPrism Plus 260523, even though the container has working NVIDIA runtime access and CUDA libraries available.

According to the TensorFlow GPU setup and recent TensorFlow 2 integration work, GPU inference should initialize CUDA devices when TensorFlow models are executed. However, TensorFlow inference in PhotoPrism always runs on CPU and never creates or registers a GPU device.

The following works correctly:

NVIDIA Container Toolkit
CUDA device access inside the container
NVENC hardware transcoding with FFmpeg
TensorFlow model loading and inference itself

However, the following expected GPU behavior never occurs:

no cuInit
no Created device /device:GPU:0
no CUDA loader logs
no GPU utilization during TensorFlow inference

This happens even with maximum TensorFlow CUDA debug logging enabled.

Relevant implementation work:

TensorFlow GPU initialization via PHOTOPRISM_INIT=tensorflow-gpu

How can we reproduce it?

Install Docker, NVIDIA drivers, and NVIDIA Container Toolkit on a Linux host with an NVIDIA GPU.
Verify that the NVIDIA runtime works outside PhotoPrism:

docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

Start PhotoPrism with TensorFlow GPU support enabled:

environment:
  PHOTOPRISM_INIT: "tensorflow-gpu"
  PHOTOPRISM_FFMPEG_ENCODER: "nvidia"
  NVIDIA_VISIBLE_DEVICES: "all"
  NVIDIA_DRIVER_CAPABILITIES: "all"

Verify that PhotoPrism sees TensorFlow vision models:

docker exec -it photoprism sh -c 'photoprism vision ls'

Expected output includes:

nasnet   │ labels │ tensorflow
nsfw     │ nsfw   │ tensorflow
facenet  │ face   │ tensorflow

Run TensorFlow inference with CUDA debug logging enabled:

docker exec -it photoprism sh -c '
TF_CPP_MIN_LOG_LEVEL=0 \
TF_CPP_VMODULE=dso_loader=5,dlopen_checker=5,cuda_gpu_executor=5,gpu_device=5 \
photoprism vision run -m labels --force --count 10 public:true
'

Observe that TensorFlow loads and runs the model, but no GPU initialization happens.

Actual result:

tensorflow: loading nasnet
Reading SavedModel from: /opt/photoprism/assets/models/nasnet
SavedModel load for tags { photoprism }; Status: success

But there are no logs for:

cuInit
Created device /device:GPU:0
Successfully opened dynamic library libcuda.so.1

Also, nvidia-smi shows no GPU activity during inference.

Have you verified that no similar reports exist?

This is a new bug that has not yet been reported or documented

What behavior do you expect?

I expect TensorFlow inference to initialize and use the NVIDIA GPU when PhotoPrism is started with:

PHOTOPRISM_INIT: "tensorflow-gpu"
NVIDIA_VISIBLE_DEVICES: "all"
NVIDIA_DRIVER_CAPABILITIES: "all"

Specifically, during photoprism vision run -m labels, TensorFlow should:

load CUDA libraries such as libcuda.so.1
call cuInit
register a GPU device, for example /device:GPU:0
show CUDA/GPU initialization messages in the TensorFlow logs
use the GPU during TensorFlow inference

Expected log examples:

Successfully opened dynamic library libcuda.so.1
Created device /device:GPU:0

GPU utilization should also be visible in nvidia-smi while TensorFlow vision models are running.

What could be the cause?

Based on the investigation, this does not appear to be caused by Docker GPU passthrough, NVIDIA device permissions, missing CUDA libraries, or unsupported GPU architecture.

The NVIDIA runtime works, /dev/nvidia* devices are available in the PhotoPrism container, CUDA/cuDNN/cuBLAS libraries are visible, and the shipped TensorFlow library contains GPU/CUDA symbols, including sm_61 support for the Tesla P4.

The likely cause seems to be that PhotoPrism's TensorFlow integration loads and runs the TensorFlow models, but does not trigger TensorFlow GPU device discovery or registration. Even with:

TF_CPP_MIN_LOG_LEVEL=0
TF_CPP_VMODULE=dso_loader=5,dlopen_checker=5,cuda_gpu_executor=5,gpu_device=5

there are no CUDA loader logs, no cuInit, and no Created device /device:GPU:0.

This suggests one of the following:

the TensorFlow C API / Go wrapper session is initialized in a way that only uses CPU devices;
the PhotoPrism TensorFlow 2.18 runtime package contains GPU support, but CUDA platform registration is not active at runtime;
or PHOTOPRISM_INIT=tensorflow-gpu installs GPU-capable TensorFlow libraries, but the current PhotoPrism vision pipeline does not actually initialize TensorFlow GPU devices.

In short: the issue appears to be in the PhotoPrism TensorFlow runtime integration or initialization path, rather than in the host NVIDIA setup.

Additional Findings

`libtensorflow_framework.so` contains CUDA loader strings


strings /usr/lib/libtensorflow_framework.so.2.18.0 | grep -Ei "cuInit|libcuda|Created device"

Includes:


Failed call to cuInit

Cannot dlopen some GPU libraries

Skipping registering GPU devices

Created device

libcuda.so.1

So GPU support appears compiled into the binary.

TensorFlow RUNPATH


readelf -d /usr/lib/libtensorflow.so.2.18.0

Shows CUDA-related RUNPATH entries:


.../nvidia/cudnn/lib

.../nvidia/cublas/lib

.../nvidia/cuda_runtime/lib

Additional Checks

PhotoPrism vision models


docker exec -it photoprism sh -c 'photoprism vision ls'

Output:


nasnet   │ labels │ tensorflow

nsfw     │ nsfw   │ tensorflow

facenet  │ face   │ tensorflow

Environment variables inside container


docker exec -it photoprism sh -c 'env | grep -Ei "CUDA|NVIDIA|TF|LD_LIBRARY"'