I'm running Debian 12 with MATE on a Ryzen 9 7900 on a PRIME B650M-A WIFI II motherboard and 32G of RAM. When I built the system 1.5 years ago, I had stability issues which resolved when I lowered the RAM clock (from spec to under spec). But recently I had some spontaneous reboots, first like once a week, and now its every few hours.
I have checked the voltages, the AC component (ripple) and the power behaves just fine. I did run a memtest86 for a few hours, reporting no problem. I've tested Prime95. The tests 1 and 2 (Smallest and Small FFT) works just fine, the other tests is killed by the kernel due to allocating too much memory if I run it on more than one core. One core runs just fine. I've also tested "memtester" (a in-OS tool for testing a part of memory in CLI), no problem. I've also stress testing with S-TUI and no problem.
The reboots is always random and not under load. Most guides on the internet on checking logs and such is written before the days of journalctl and I have a hard time finding anything in any logs (suggestions would be helpful!). "last reboot" doesn't say much apart from "crash". I've tried "journalctl | grep [something]" for things like kernel panic, temperature, crash, etc, and not found anything related. I also installed "kdump-tools" and "kexec-tools" but there is simply no dump in /var/crash after a reboot. So, I can't tell if i't's a HW issue or a SW one. High load and memory issues doesn't seem to causing the problem and the power supply is delivering nice and clean DC at specified voltages. I've reseated the RAM just in case, and reseated all other connectors I could find (SATA, etc). I have no cards installed and using the built in-graphics. I haven't tested all the buck regulators on the motherboard of course, so I only know that the ATX voltages are good. But again, reboots occur during normal use and stress testing does not cause reboot.
What should I do next? This is infuriating as stability is all I want. Nothing is overclocked (and never was) and right now my RAM is underclocked as a test. At random interval (record this day, five times) the screen goes blank and the system restarts. I had hoped that I could get a clue from some logs, such as a sensor or something. But I don't know what to look for really.
Edit: I'm someone who typically buys a new computer once every decade. This UEFI BS is new to me and very confusing and seems like a grift to push unsecure proprietary BS into a system. But, if someone know of anything to check there, let me know.
I have checked the voltages, the AC component (ripple) and the power behaves just fine. I did run a memtest86 for a few hours, reporting no problem. I've tested Prime95. The tests 1 and 2 (Smallest and Small FFT) works just fine, the other tests is killed by the kernel due to allocating too much memory if I run it on more than one core. One core runs just fine. I've also tested "memtester" (a in-OS tool for testing a part of memory in CLI), no problem. I've also stress testing with S-TUI and no problem.
The reboots is always random and not under load. Most guides on the internet on checking logs and such is written before the days of journalctl and I have a hard time finding anything in any logs (suggestions would be helpful!). "last reboot" doesn't say much apart from "crash". I've tried "journalctl | grep [something]" for things like kernel panic, temperature, crash, etc, and not found anything related. I also installed "kdump-tools" and "kexec-tools" but there is simply no dump in /var/crash after a reboot. So, I can't tell if i't's a HW issue or a SW one. High load and memory issues doesn't seem to causing the problem and the power supply is delivering nice and clean DC at specified voltages. I've reseated the RAM just in case, and reseated all other connectors I could find (SATA, etc). I have no cards installed and using the built in-graphics. I haven't tested all the buck regulators on the motherboard of course, so I only know that the ATX voltages are good. But again, reboots occur during normal use and stress testing does not cause reboot.
What should I do next? This is infuriating as stability is all I want. Nothing is overclocked (and never was) and right now my RAM is underclocked as a test. At random interval (record this day, five times) the screen goes blank and the system restarts. I had hoped that I could get a clue from some logs, such as a sensor or something. But I don't know what to look for really.
Edit: I'm someone who typically buys a new computer once every decade. This UEFI BS is new to me and very confusing and seems like a grift to push unsecure proprietary BS into a system. But, if someone know of anything to check there, let me know.
Statistics: Posted by joga — 2024-11-13 00:55 — Replies 4 — Views 95