If you are a laptop user, you may want to avoid driver version 550 because there is a megathread on the NVIDIA forum where users are reporting that this version is making their distributions randomly crash.
Photos from my personal laptop last night after doing some reboots just for fun and testing (Arch Linux 6.9.1-zen + nvidia-dkms 550):
Crash right after decrypting the disk
Crash right before trying to load sddm
It does not follow a pattern. Sometimes the laptop boots, sometimes it crashes. I would say that in 40% of the times that I've tried to boot it crashed during this week.
The way it happens, you can't say for sure where nvidia
module is poking around because sometimes the crash is related to vfs
sometimes with rcu_schedule
. It happened to me right after decrypting my secondary NVMe, but it also happened right before trying to launch sddm
so it is pretty random. Some folks reported it happened during upgrades when udev
was triggered, but I didn't have that more catastrophic version of this problem (yet), having to deal with the system running and maybe breaking my initrd or package database.
The available options to mitigate this situation are:
- Revert back to version 545 or 535 with
dkms
- Use the beta version 555. It looks like it is stable enough from the standpoint of this bug.
- Use the following workaround involving disabling
udev
triggers and setting some kernel command line related to GSP firmware.
Nvidia Thread: Series 550 freezes laptop
Quoting: Luke_NukemZero problems on 3 of my laptops with fedora 40 and the rpmfusion version or manual install. No issues with 555 beta either (which runs damned smooth).
Yup. Fedora seems to have patches to it's kernel that mitigates the situation.
There is one Debian report where the guy stated that using the open kernel drivers fixed the issue, and multiple reports for openSUSE where it crashes during shutdowns only, and that systemctl reboot does not reproduces the issue
Did you have the chance to review `audit.log` to check if any protection(selinux, fapolicyd) was triggered and could be limiting the `nvidia` module for good?
Quoting: nwildnerQuoting: Luke_NukemZero problems on 3 of my laptops with fedora 40 and the rpmfusion version or manual install. No issues with 555 beta either (which runs damned smooth).reports for openSUSE where it crashes during shutdowns only, and that systemctl reboot does not reproduces the issue
I somehow find the openSUSE Tumbleweed report strange. As the user runs Kernel 6.8 because an up-to-date TW should run 6.9.3.
Which somehow leads me to suspect that the user might be missing a few patches / fixes and issues are to be expected.
Edit: Oh the report is from April 22nd. Nevermind.
But why should this be limited to Laptops only?
Last edited by Vortex_Acherontic on 10 June 2024 at 12:32 pm UTC
Quoting: Vortex_AcheronticBut why should this be limited to Laptops only?
If you look at that thread, the vast majority of the reports come from folks that are running Nvidia on laptops.
Maybe it is my impression but, looks like laptops are more prone to that bug or PC towers are not being reported..
If you look at the reports you will find ASUS TUF, Legion, HP ZBook, Lenovo X1 Extreme Gen4, and there is one guy saying that it didn't face the same problem on his PC Tower.
Good to see, that there is some discussion going on in the nVidia forums.
Anyway, for me it was fixed by enabling the following services:
sudo systemctl enable nvidia-hibernate.service
sudo systemctl enable nvidia-resume.service
sudo systemctl enable nvidia-suspend
sudo systemctl enable nvidia-persistenced.service
sudo systemctl enable nvidia-powerd.service
Maybe, it helps anyone and you can like me avoid switching to the beta drivers.
Last edited by Trias on 11 June 2024 at 7:32 am UTC
Quoting: TriasIt's sometimes very interesting (and frightening) to read through some logs. Like, look at Liam's first picture. If my notebook reported something like "[ 3.318673] ? __die_body.cold+..." I would sure be frightened... :).
That was MY laptop, not Liam's. :)
And yeah, it is frightening to see a _die_body.cold+0x8 all times and have to long press power at least 40% of the times until you have a sane boot.
See more from me