GPU numbering on Windows possibly in wrong order · mozilla-ai/llamafile#26

Repository metrics

Stars: (24,439 stars)
PR merge metrics: (Avg merge 2d 20h) (14 merged PRs in 30d)

Description

I have multiple NVIDIA GPUs and originally thought it was reporting usage of the wrong one. Now I'm not sure it's using either of them. Is there a way to check for sure, or to pass in preferred device?

Windows 11 session here, x64 native tools command prompt

https://gist.github.com/danbri/d8a387321642b14336701dedf166527f (excerpts only below)

It correctly finds 2 NVIDIA CUDA GPU devices:

ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6

[...]

Later it reports:

ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llm_load_tensors: mem required = 8801.76 MB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/43 layers to GPU llm_load_tensors: VRAM used: 0.00 MB

In the Web UI on :8082 when I start a task, I see the supposedly "main device" GPU (a 3090, external usb box; not the most efficient use of it but hey) at 0% utilization in Task Manager. The built-in NVIDIA appears to be in low level use (4% max) but that seems to be background Window Manager usage. CPU usage goes to 45 or 50% while generating response tokens. Given the "offloading nothing to GPU" log messages, I guess it isn't actually using either NVIDIA GPU, despite noticing them?

If we disconnect the external 3090 NVIDIA GPU, and re-run llamafile, it recognises the remaining internal NVIDIA, and things seem similar except the log now says just

llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 8801.76 MB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/43 layers to GPU

The only processes Task Manager reports for what it calls GPU 1 (the NVIDIA) are Desktop Window Manager and Client Server Runtime Process.

I started out thinking it was using the wrong GPU, I'm not convinced now that either GPU is being used.

Contributor guide

Research direction: Investigate the GPU device enumeration order in the CUDA backend and trace how the main device is selected and why layers are not offloaded to GPU.
Tech stack: cpp
Domain: backendperformance
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Mostly clear
Prerequisites: C++CUDA
Newbie friendliness: 60

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.