12

I have a custom-built PC (assembled from individual components, though not by me). The hardware specifications are as follows:

  • CPU: AMD Ryzen 5 1600
  • RAM: 16GB DDR4 (2x8GB) / 2400 MHz
  • GPU: NVIDIA GeForce RTX 2060
  • SSD: Toshiba TR200 240GB
  • HDD: Seagate 1TB

The system crashes frequently and randomly—sometimes while gaming, sometimes while simply browsing Google, and even when copying files from one location to another.

I have analyzed the crash dumps from Windows using WinDbg, and most of them point to issues related to the NVIDIA GPU or its drivers. However, some also indicate potential problems with the SSD. To rule out disk-related issues, I ran CrystalDiskInfo, and both the SSD and HDD appear to be in good condition.

To further investigate, I attempted to boot several Linux live distributions from a USB drive, but all of them crashed as well, resulting in a kernel panic. The affected modules listed in the crashes vary and include NVIDIA drivers, sound drivers, Wi-Fi drivers, etc. The crashes in Linux have occurred even when performing simple tasks, such as copying files or merely opening the terminal without executing any commands.

Linux often reports errors such as "Linux watchdog bug: soft lockup CPU# stuck for X seconds," and after each crash, it references the motherboard and its version: Micro-Star International M5-7A38 B450M PRO-VDH MAX, dated 07/11/2019.

The PC does not overheat, and the crashes persist even when booting into Linux live environments with failsafe modes. Notably, the system does not shut down or restart abruptly—it simply crashes, displaying either a kernel panic (in Linux) or a blue screen (in Windows).

As a next step, I plan to update the BIOS, disconnect both the SSD and HDD, and install a new NVMe SSD with a fresh installation of Windows 11. Unfortunately, I cannot remove the NVIDIA GPU, as the system lacks integrated graphics. Beyond this, I am unsure what else to test, as the crashes appear to be unrelated, with no clear pattern.

Do you have any insights into what might be causing the issue? What additional troubleshooting steps would you recommend?

Update: Many people suggest that the issue could be related to the RAM. I ran MemTest86+, but no errors were detected. I'm running it more times to be sure. Additionally, I booted a newer Linux live environment (Kubuntu 24.04) compared to the previous two I had tested, and I did not experience any crashes. However, since the crashes are completely random, this could just be a coincidence. To be sure, I used the system normally, performing various tasks, and it did not crash.

I plan to test a different graphics card and update the BIOS. Unfortunately, I am unable to test another CPU or PSU. In the BIOS, all overclocking settings, including timing adjustments, are set to default. There does not appear to be any active overclocking.

FINAL UPDATE The problem was the RAM, as many people told here. I run memtest86+ and it showed me more than 25k errors. After some tests I was able to detect which slot was causing problems and replace it. After that I had another unrelated problem with the BIOS so I couldnt finish all the tests and tell you if it was the correct answer, but it is finally working. Thank you so much.

4 Answers4

19

Random crashes occurring in apparently unrelated places are almost always memory. And surprisingly, memory does deteriorate over time. The first thing that I'd do in this case is replace your memory with fresh-bought stuff; 16GB of DDR4 is a relatively inexpensive test.

But as DrMoishe Pippik pointed out in his comment, if your machine was built by a gamer, it could easily be overclocked, and so a very good thing to do would be to check the BIOS / UEFI pages for CPU speed and memory timings, and see if it has been overclocked. There is an nVidia Control Panel applet that you can usually get to with a right-click on screen that will tell you if the GPU or video memory are being overclocked as well.

In both places, if you have an option for "optimum speed", that would be the one to select.

Giacomo1968
  • 58,727
tsc_chazz
  • 542
  • 3
  • 8
9

This isn't meant to be a panacea but your problem resonates a lot with one i had previously: Proxmox doesn't boot after CPU change

In that question, I had changed the CPU because I was experiencing system crashes similar to yours, and the new CPU didn't boot at all. In the end the motherboard was bad.

You should ensure you've tried all the troubleshooting steps:

  1. do a 24-hour memtest run.
  2. confirm your graphics drivers are up to date
  3. upgrade your UEFI firmware to the latest version
  4. the previous step will wipe your overclocking settings, if any, which you need to do anyway
  5. test the CPU in another PC (all AM4 motherboards accept the 1600X without special considerations).
  6. try another video card, it can be the absolute bottomest of the bin, you just need something to get video out. borrow a friend's, or acquire some flavor of a GT730 from somewhere.
  7. alternatively, try your own video card in another PC to confirm it isn't to blame.

These steps should enable you to find exactly what component is to blame by finding the common element on when it does and doesn't crash. In my case, it turned out that the motherboard was defective, which was not on my bingo card because i thought of motherboards as just "dumb" components.

9

You are trying to rely exclusively on software to diagnose a hardware issue.

Physical diagnosis is needed.

Perform these one bullet point at a time. For example, don't swap your RAM and reseat your GPU at the same time because if the issue goes away then which action gets the credit? After performing the RAM steps, then return them to their original configuration before proceeding to the GPU steps.

Note that after performing a single bullet point then you should try reproducing the issue by using the computer.

RAM:

  • Try removing one stick
  • Try removing the other stick
  • Swap the sticks
  • Try using one stick in each RAM slot
  • Try using the other stick in each RAM slot
  • Exhaust your possible RAM configuration options

GPU:

  • Try reseating the GPU
  • Borrow a GPU or buy a cheap one to swap in for a few days
  • Re-apply quality thermal paste and pads

GPUs don't always have temp sensors on all of their components so if it is actually overheating then you'll be clueless.

CPU:

  • Try reseating the CPU
  • Inspect the bottom of the CPU for scratches
  • Inspect the motherboard pins
    • You may find that whoever built your PC bent some motherboard pins
    • Bent pins could cause erratic crashing due to thermal expansion/contraction
    • Bent pins could also cause crashing since your CPU isn't making contact when a specific CPU feature/instruction-set is called upon
  • Re-apply quality thermal paste

SSD:

  • Install it in a secondary slot if available
  • Buy a new SSD

PSU:

  • Properly testing a PSU requires tools that most people don't posses
  • It is cheapest to just try a new PSU
  • Maybe the old one delivers faulty voltage under specific loads
  • Maybe the old one can't cope with certain amperage spikes

Motherboard:

  • Properly testing a motherboard isn't for the faint of heart
  • Check reviews for your motherboard model and see if your symptoms are a common issue
  • Visually inspect all the slots and ports for cracking, debris, bent pins, burn marks, etc...
  • Visually inspect the capacitors and VRMs for signs of deterioration
  • Googling "MotherboardModel common problems" is a good start
  • If the motherboard seems bad then replacement is the common option
MonkeyZeus
  • 9,841
3

Perhaps its just me, but I've found in the past that random-seeming crashes like you've described are almost always due to my video card overheating. The fact that the logs are pointing you to your video hardware makes me doubly suspicious. So I'm curious how you know "The PC isn't overheating"?

One thing I'd suggest is opening up the case and running the PC for a while. Periodically do a visual inspection of all the fans. In one instance I discovered that one of the fans on my video card was occasionally getting stuck and not spinning. Replacing the fan largely fixed the issue.

Some indications that the video card might be having heat trouble are:

  • You have better luck booting after a crash if you let the PC sit (powered down) for a few minutes.
  • It happens more often, or more reliably, when the GPU is under continuous load (3D gaming, rendering, bitcoin farming, SETI@Home, etc.)
  • Changing video drivers seems to change how often it happens.
  • Occasional graphical artifacts when under load.
  • 3D games develop performance issues they didn't used to have.

Of course one easy check would be to just pop in a different video card, and see if the issue goes away. Those things are damn expensive though, so if you're not putting together your own PC's, you likely don't have spares sitting around.

T.E.D.
  • 201