[Bug]: amdgpu freeze resulting in GPU reset on large workloads
#2,656 opened on 2024年3月27日
説明
Checklist
- The issue has not been resolved by following the troubleshooting guide
- The issue exists on a clean installation of Fooocus
- The issue exists in the current version of Fooocus
- The issue has not been reported before recently
- The issue has been reported before but has not been fixed yet
What happened?
I understand the amdgpu support is experimental, however I want to document this issue to guide others who run into it. My system specs:
- GPU: Sapphire Nitro+ AMD Radeon RX 7800 XT
- RAM: 64GB
- Swap: 16GB
- GLX version: Mesa 24.0.2-1
Steps to reproduce the problem
When I ask Fooocus to do "too much", my display will freeze including keyboard/mouse and it appears I have to reboot the system. In fact, later I found this is not necessary, I can just log in via SSH and restart the display server e.g. systemctl restart lightdm. I observe this on dmesg:
[Mar27 23:05] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=6436600, emitted seq=6436602
[ +0.000160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 321623 thread Xorg:cs0 pid 321627
[ +0.000127] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ +0.001546] amdgpu 0000:03:00.0: amdgpu: Guilty job already signaled, skipping HW reset
[ +0.000011] [drm] Skip scheduling IBs!
[ +0.000001] amdgpu 0000:03:00.0: amdgpu: GPU reset(10) succeeded!
[ +0.000005] [drm] Skip scheduling IBs!
[ +0.000004] [drm] Skip scheduling IBs!
[ +0.000002] [drm] Skip scheduling IBs!
[ +0.000003] [drm] Skip scheduling IBs!
[ +0.000002] [drm] Skip scheduling IBs!
Also, apparently there is something online called the "AMD GPU reset bug" - but my GPU does not seem to be affected by this in that I can trigger this bug many times, cause my screen to freeze, observe GPU reset(n) succeeded! via dmesg where n keeps going up by 2 each time, restart my display server via systemctl restart lightdm, and everything is fine afterwards, and I can start Fooocus again to do more stuff. In other words, this bug is not that bug.
What is "too much"? Well for me for 64 RAM normally it is like, running a Windows VM, watching a HD video, generating Upscale 2x with Performance = Quality on Fooocus, and running Upscayl at the same time. This is fine to avoid manually, I can just be careful when running Fooocus.
HOWEVER, you can also easily trigger it by giving Fooocus an input image that is quite big, even if the computer is doing nothing else. For example this one, 12 megapixels:
This is more annoying to avoid because sometimes you just want to drag and drop random shit from online into Fooocus and not have to worry about how big it is.
What should have happened?
Ideally, Fooocus should throw an exception in these cases, with something like "Out Of Memory" (or whatever the real reason is) rather than letting the GPU freeze up and reset. I'm not sure how feasible this is however.
What browsers do you use to access Fooocus?
Google Chrome
Where are you running Fooocus?
Locally
What operating system are you using?
Debian GNU/Linux
Console logs
dmesg logs are above. As for Fooocus logs, in fact Fooocus itself does not notice the problem, and there are no logs. The screen freezes, but you can run Fooocus inside a tmux session and attach to it by logging in via SSH, to confirm that there are in fact no logs and no errors. Nothing is output on the Fooocus tmux console, even though dmesg says that the GPU has already been reset. You can even tell Fooocus to quit with Ctrl-C after this, and it will tell you it's trying to exit, but this won't succeed and it just hangs there until you restart your display server.
Additional information
No response