Having a weird problem with consitently loosing card (s)

I have a strange problem and have exhausted all my options (that I can think about), so I am reahcing out to the community for help.

My rig powers up and boots into HiveOS and all 7 of my GPUs are typically detected and recognized. After some time (it may be minuites, an hour or a couple of hours) I will notice that my hash rate has dropped and when I look at my miner I see that it lost one of the cards. Sometime it is more than one card and they are not always the same.

Here is my setup and what I have tried so far:

  • MSI Z390-A-PRO motherboard (MB has a PCIe power that I currently have powerd even though the risers have their own power, but I also tried without the PCIE power cable plugged to the board)
  • i5-9600K CPU
  • 128 GB SSD for HiveOS (plugged to SATA 1 port)
  • 2 x 8GB RAM
  • 1 x M.2 to PCIE adapter
  • 2 x 1000 watts power supplies
  • 7 x PCIe risers (tested and confirmed that they work). All risers are powered from the VGA 6 pin cables coming out of the PSU (no molex or sata cables here)
  • All risers are powered from the same power supply that is powering the associated GPUs
  • 3 x RTX 3060s
  • 4 x RTX 3070s
  • HiveOS 0.6-203@210410 with NVIDIA driver 460.67 >> miner is set to start with 15 seconds delay, and the OC settings are set to kick in after 120 seconds.

My BIOS is setup to Gen 1/2 and 96 for the PCIE, disabled everything that I dont need (audio controller, on-board video, virtualization, etc.)., enabled 4G, Windows 10 WHQL Support to UEFI, changed power settings to turn on, disabled serial ports…

I have tried the most up to date BIOS for the MSI Z390-a-pro, and also tried a few of the older versions. I think the latest BIOS firmware was the least stable and inconsistent. Currently running a version from 2019.

Here is a sample of what it look slike when one of the card is dropped. This time it took only about 2o minutes after the restart. Before that it took 2 hours…

I have also played with the overclocking settings and it does not seem to matter. In fact - I think that the longest ir ran for (36+ hours) was with higher OC settings.

Two observations:

  1. I noticed that the motherboard clock is off. I changed the CMOS battery and reset the board, but when I go and adjust to correct date and time, it does not retain it. It does not even register it as a change in the BIOS so it is not prompting me to save. I have tried adjusting the date and time along with some other changes that I need to make anyways, so I can save and exit, but no luck the next time I go in the BIOS the date and time has changed. It is tyupically a day ahead and several hours off.
  2. When HiveOS boots up I see in the console a warning message that the time is messed up. Not sure if this could be casuing all my troubles.

I am at a loss - any help and suggestions are appreciated!

I can now confirm that Windows sees all 7 cards with no problem and can mine, but it is weird that HiveOS drops one card. Is this a bug in the OS?

I’ve been dealing with the same EXACT symptoms on my MSI Z390 gaming plus board with HiveOS. Everthing down to taking hours for a clean reboot. I found pushing the config file finds the rig quickly when I reboot or hard power off and then turn on with my wifi switch remotely. I hate to say this but glad I’m not alone. The dropped lane is 75% GPU 2 with 25% GPU5. I only have 6 on this rig. Switched everything out. Still no avail. Like you, I haven’t ran longer than 36hrs. Based off your windows success, I’m going to try another miner OS prob this weekend. I’ll let you know how I make out

I did get a different motherboard that supports 12 GPUs and the issue seems to have gone away, mostly away. I am now thinking this is all caused by something else in my setup. Btw, I replaced all risers and cables and was still having the original issue with the Z390 a pro…
At least now I get a more consistent boot up where all cards show up in HiveOS and it goes a couple of days before it craps out…good luck

Are you still using the MSI Z390-A Pro.

Nope, why?

Not sure if this was fixed yet or not but I had similar issue where one GPU would drop. I found out which one it was took it out and put back in added power back, riser etc. The issue I think I had was the card was not fully pushed down on the bottom so I made sure it was pushed all the way in and tight. Restarted and so far its good. So maybe just take it out and put it back in.

Hi all,
I’m newbie in mining. I have the same problem, 1 gpu will detected error in few hours.
I already change the riser etc but still the same.

I confused about the error. Logically, if it error ( not mining ) why all the gpu still hot. And then i use power meter, the result is power consumed the same as 6 gpu power. And the reported hashrate & realtime hashrate also the same amount as 6 gpu.

Anyone already resolve this problem?

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.