AMD GPUs are cursed for me

aksdb@lemmy.world · 10 months ago

AMD GPUs are cursed for me

bazsy@lemmy.world · 10 months ago

Did you check the system logs to see what caused it?

Many things can result in seemingliy random crashes. Any overclock (including XMP and Expo) or undervolt or even a bios version can be problematic.

I would check first if it’s stable on windows.

Captain Janeway@lemmy.world · 10 months ago

It’s not stable on Windows either. But I haven’t looked at logs because I didn’t really know what - or how - to check.

bazsy@lemmy.world · 10 months ago

Most distros use systemd and its logging solution: journald. You can use journalctl to read the logs around the time of the crash for e.g.:

journalctl -S -5m this shows the last 5 minutes. Use this when a game crashes but the system continues working and did not reboot.
journalctl -b -1 -S -10m this shows the last 10 minutes from the previous boot. Use this if the crash froze the whole system and rebooted.

Look for red lines (errors) and what wrote them. AMD GPU faults usually have the ‘amdgpu’ mentioned, memory errors could appear as ‘protection fault’.

Captain Janeway@lemmy.world · 10 months ago

journalctl -S -5m

Looks like this is the errors I’m seeing. I know it’s not helpful to just drop this in the chat, but I’m doing it for posterity (and to let you know your comment did in fact help me)!

Feb 04 16:47:40 computer kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Feb 04 16:47:40 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=17063130, emitted seq=17063132
Feb 04 16:47:40 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 161654 thread redDispatcher9 pid 161668
Feb 04 16:47:40 computer kernel: amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
Feb 04 16:47:40 computer kernel: amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Feb 04 16:47:40 computer kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Feb 04 16:47:40 computer kernel: amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Feb 04 16:47:40 computer kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Feb 04 16:47:40 computer kernel: [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx

bazsy@lemmy.world · edit-2 10 months ago

Happy to help! Tough you are right, this is a rather generic error that doesn’t help much just confirms that the GPU is the issue.

At this point it could be a driver issue since there are similar open bug reports. A hardware problem is still possible since you previously said that it’s unstable on windows too, and power related issues can also lead to this error message.

Captain Janeway@lemmy.world · edit-2 9 months ago

EDIT: Tentative solution: CoreCtrl

CoreCtrl allowed me to underclock my Radeon 5600XT GPU (currently set values to GPU 800MHz and memory set to 500MHz). I say “tentative” because this problem has been persistent for years, but I’ve been running Cyberpunk for 1 hour at 60FPS on High settings (and mostly 60FPS on Ultra, but I had some FPS drops). Even if this solution isn’t 100% perfect, I think some combination of changing the GPU values is probably going to make my rig much more functional.

I found CoreCtrl based on a Reddit thread last night but didn’t have time to test it until this evening after work. Seems to have made a world of a difference.

Yeah I’ve tried just about every feasible kernel parameter for amdgpu module, updated my kernel, to 6.2 on Linux Mint, and I’ve tried several different BIOS settings. My system runs everything reasonably. Even Cyberpunk 2077 is generally at 60FPS. But after about 5minutes of gaming on Cyberpunk 2077, it crashes. Other games last longer, which is why I use Cyberpunk 2077 to stress test my system.

These are my system specs:

PSU: 850 Watt 80 PLUS Gold Fully Modular ATX
CPU: AMD Ryzen 7 2700 Eight-Core Processor × 8
GPU: Radeon 5600XT
RAM: G-SKill DDR4-3600 CL16-19-19-39 1.35V (2x16GB = 32GB total system memory)
SSD: Samsung (MZ-V7E500BW) 970 EVO SSD 500GB - M.2 NVMe
MOBO: Asus x470 Pro
Other: TP-Link AC1200 PCIe WiFi Card for PC (Archer T5E) - Bluetooth 4.2, Dual Band Wireless Network Card installed in PCIEx1_3 which seems like it could be a variable I should remove, but I’ve tried removing it and didn’t see any changes in behavior. I’ve tried various PCIEx1_* slots with similar results.

I don’t really see where I might be going wrong here. I bought this all ~4 years ago and I’ve always had these intermittent crashes. It’s admittedly worse on Linux, but it still occurred on Windows.

Anyways, I spent about 5 hours last night reading bug forums, testing various amdgpu mod parameters, settings in my BIOS, and even re-configuring my fans to provide (potentially) more optimal cooling. None of this really made a difference. I run two 1080p monitors (not exactly breaking the bank here). I had a lot of hope regarding one forum about ring gfx_1.0.0 errors related to how AMD reads the GPU in Linux. My graphics card is detected as: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] and apparently some machines used to accidentally use the total allocated memory for 5700XT instead of the 5600XT. This resulted in some form of corrupt memory allocation. That sort of behavior would make sense for my system since it runs well, but just fails suddenly.

Other errors I’ve seen are:

Feb 04 20:17:01 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=116669, emitted seq=116671
Feb 04 20:17:01 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 3668 thread redDispatcher12 pid 3684
...
Feb 04 20:26:16 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=34068, emitted seq=34071
Feb 04 20:26:16 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4208 thread redDispatcher13 pid 4232
Feb 04 20:26:17 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out
...
Feb 04 21:00:43 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.3.0 timeout, signaled seq=3085, emitted seq=3086
Feb 04 21:00:43 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 3771 thread redDispatcher8 pid 3783
...
Feb 04 22:28:50 computer kernel: [drm:amdgpu_device_ip_early_init [amdgpu]] *ERROR* early_init of IP block  failed -19
Feb 04 22:28:50 computer kernel: [drm:amdgpu_device_ip_early_init [amdgpu]] *ERROR* early_init of IP block  failed -19
Feb 04 22:36:57 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=171774, emitted seq=171776
Feb 04 22:36:57 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4122 thread redDispatcher5 pid 4131
...
Feb 04 22:45:46 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out
Feb 04 22:45:56 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:80:crtc-1] hw_done or flip_done timed out
Feb 04 22:46:19 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=123, emitted seq=124
Feb 04 22:46:19 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4187 thread redDispatcher8 pid 4202
...
Feb 04 23:49:45 computer kernel: [drm:gfx_v10_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Feb 04 23:49:45 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=435155, emitted seq=435157
Feb 04 23:49:45 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 3668 thread redDispatcher12 pid 3690
...
Feb 04 23:58:58 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=66268, emitted seq=66270
Feb 04 23:58:58 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4180 thread redDispatcher11 pid 4196
Feb 04 23:58:58 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out

^ These are all errors which occurred from various tests of amdgpu module settings and/or BIOS settings. The common thread is some form of ring XXXX timeout.

These two threads seemed like my best chance, but their proposed solutions didn’t help:

Captain Janeway@lemmy.world · 10 months ago

Thank you! This is super helpful.