Nonstop system freezes/crashes since switching to AMD
brokeassben Jun 6, 2021
Hey all,
I'm absolutely no expert, but have been using Linux since 2003 and generally have been able to troubleshoot on my own and do IT for a living. That said, I am at my damn wits end trying to solve this issue. Since building my PC last year and moving from Intel + Nvidia, I have had very frequent system freezes that require hard shutdowns. I'd generally figure this was some defective hardware if I hadn't seen a number of people with similar or roughly the same components going through the same thing and posting about it on Reddit and other various places. If anyone has experienced this same thing and figured out a solution, I'd really appreciate any help.

What triggers the freeze:
Happens randomly--sometimes as early as the login screen, sometimes with only a browser open, sometimes not til after many many hours of use.
Can usually trigger a freeze faster by running a game through Proton, but that can also be very inconsistent.

So things I've tried:
Updating BIOS (I've updated the BIOS more times than I can recall at this point)
Disabling C-states in BIOS (this was suggested to be the solution by several people on Reddit)
Updating to a newer kernel (tried up to 5.12 with no success)
Updating to a newer mesa
SSH-ing into the computer to attempt to capture logs with no helpful info in them
Installing a different Linux OS (Fedora, Arch, newest Ubuntu)
Installing Windows 10 (runs consistently without issue, but I obviously don't want to run Win)

System info:
Ryzen 9 3900x
Radeon RX 5700 XT
ASUS TUF Gaming B550M-Plus (Wi-Fi)

Thanks!
Ben
Liam Dawe Jun 6, 2021
I would seriously suggest checking the RAM. Most of the time this is a RAM problem.
HyperRealisticRock Jun 7, 2021
In my experience its almost always power, either a loose connector or pin not seated or the PSU unit itself.
Xpander Jun 7, 2021
Yeah i think most likely its either PSU or RAM issues.
Though you say windows runs without issues. Which is a bit weird then.
WHat Aquabat suggested is worth a try though

Last edited by Xpander on 7 June 2021 at 7:01 am UTC
damarrin Jun 7, 2021
What others have said, though do try a different gfx card (nvidia) if you can. I've had so many problems with AMD gfx over the years... /o\
Whitewolfe80 Jun 7, 2021
What temp is your cpu getting hitting ? have you overclocked because i did experience something similar about a year ago with 2600 and it turned out cpu was hitting high 90s and my motherboard was spiking to over 100. A really stupid mistake on my part regarding voltage and core clock settings.

Last edited by Whitewolfe80 on 7 June 2021 at 1:10 pm UTC
brokeassben Jun 7, 2021
What temp is your cpu getting hitting ? have you overclocked because i did experience something similar about a year ago with 2600 and it turned out cpu was hitting high 90s and my motherboard was spiking to over 100. A really stupid mistake on my part regarding voltage and core clock settings.
I haven't overclocked since it hasn't run stably at all with the exception of windows, but I REALLY don't want to dual-boot. It generally maxes out around 65 when gaming for long stretches--turns out water cooling is overhyped and good airflow is just as good in many cases.

I would seriously suggest checking the RAM. Most of the time this is a RAM problem.
The RAM (as well as the PSU) is repurposed from my previous build, has been re-seated, is listed as compatible with the motherboard and passes both memory tests I've tried...which I know doesn't always mean it's good RAM. I should probably just invest in some faster RAM and splurge on 32GB.

Also another thing you could try is adding this options to grub:

processor.max_cstate=1 rcu_nocbs=0-23 idle=nomwait
I've tried this out and haven't had any crashes after a couple hours of gaming, which makes me cautiously hopeful! I'll try a few more games after work today to see how it goes. My hopes have been dashed a few times with other attempts at fixes.

A weird added detail--my system is 100% stable while running CS:GO 🤷‍♂️

I really appreciate all of you taking the time to respond!

Ben
dvd Jun 7, 2021
What temp is your cpu getting hitting ? have you overclocked because i did experience something similar about a year ago with 2600 and it turned out cpu was hitting high 90s and my motherboard was spiking to over 100. A really stupid mistake on my part regarding voltage and core clock settings.
I haven't overclocked since it hasn't run stably at all with the exception of windows, but I REALLY don't want to dual-boot. It generally maxes out around 65 when gaming for long stretches--turns out water cooling is overhyped and good airflow is just as good in many cases.

I would seriously suggest checking the RAM. Most of the time this is a RAM problem.
The RAM (as well as the PSU) is repurposed from my previous build, has been re-seated, is listed as compatible with the motherboard and passes both memory tests I've tried...which I know doesn't always mean it's good RAM. I should probably just invest in some faster RAM and splurge on 32GB.

Also another thing you could try is adding this options to grub:

processor.max_cstate=1 rcu_nocbs=0-23 idle=nomwait
I've tried this out and haven't had any crashes after a couple hours of gaming, which makes me cautiously hopeful! I'll try a few more games after work today to see how it goes. My hopes have been dashed a few times with other attempts at fixes.

A weird added detail--my system is 100% stable while running CS:GO 🤷‍♂️

I really appreciate all of you taking the time to respond!

Ben

For my Ryzen disabling C6 fixed the problem too (I have an old 1300x). Weirdly, i read too somewhere that the C-state issues were fixed in newer hardware. But i have C6 disabled in the BIOS and the kernel options too.
(Seems like people still had problems with C6 last year: https://bugzilla.kernel.org/show_bug.cgi?id=206487)
tuubi Jun 8, 2021
remember Linux power consumption/usage are not as good as Windows
Just to clarify, some Linux drivers and software aren't as power-efficient as some drivers and software on Windows. If Linux wasted more power than Windows in general, you can bet your butt it wouldn't be so popular in server rooms and on embedded devices.

Sorry for going off topic.
RossBC Jun 8, 2021
Try disabling core boost and precision boost in your bios.
Run your computer and see if it crashes.
I have a similar problem with my system, the motherboard for one reason or another isn't managing the voltages properly.
And stock overclock (core boost and precision boost) were on by default.
Guess it could also be the cpu as well, but I found killing the motherboards stock oc fixed the problem for me.
running a R7 5800x.
Pangaea Jun 8, 2021
Your problems sounds a lot like what was reported in this rather big thread: https://www.gamingonlinux.com/forum/topic/4128

It may not be a client-side problem. Like @The_Aquabat is also inferring.
brokeassben Jun 9, 2021
Well shit. It had a few hours of stability followed by several freezes. Briefly thought that @TheAquabat had the answer with this:
processor.max_cstate=1 rcu_nocbs=0-23 idle=nomwait
or
processor.max_cstate=5 rcu_nocbs=0-23 idle=nomwait
Thanks, though. Here's my dmesg output in hopes somone can see something obvious I'm missing:
https://pastebin.com/2pgiJ52y

Try disabling core boost and precision boost in your bios.
Got a crash within about 5 min :(

I would seriously suggest checking the RAM. Most of the time this is a RAM problem.
Tried dropping the RAM to its base frequency as well as max. Sorta took your advice and am replacing the RAM out of desperation for a stable computer. It's been a frustrating experience.

Last edited by brokeassben on 9 June 2021 at 4:53 am UTC
Xpander Jun 9, 2021
the dmesg log you posted doesnt have anything weird except your drive is failing..ext4 errors
Liam Dawe Jun 9, 2021
the dmesg log you posted doesnt have anything weird except your drive is failing..ext4 errors
Ouch, didn't even think of that. A failing drive could even be an issue.
brokeassben Jun 9, 2021
the dmesg log you posted doesnt have anything weird except your drive is failing..ext4 errors
Ouch, didn't even think of that. A failing drive could even be an issue.
That might explain why Windows is stable since it can't mount that drive. Might also explain why Steam's shader caching will hang for hours at a time. After several years of help desk work, that should have occurred to me 🤦‍♂️ Also just dawned on me that I've had the two platter drives in three different computer builds since 2013 and they've gotten A LOT of use. SMART reporting shows OK health, but several errors including many marked as "pre-fail." Well I feel like an idiot. Going to disconnect the drive and see if that solves all of my woes.
While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations: PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!
Login / Register