Having a weird problem with consitently loosing card (s)

ratimux · April 12, 2021, 7:45pm

I have a strange problem and have exhausted all my options (that I can think about), so I am reahcing out to the community for help.

My rig powers up and boots into HiveOS and all 7 of my GPUs are typically detected and recognized. After some time (it may be minuites, an hour or a couple of hours) I will notice that my hash rate has dropped and when I look at my miner I see that it lost one of the cards. Sometime it is more than one card and they are not always the same.

Here is my setup and what I have tried so far:

MSI Z390-A-PRO motherboard (MB has a PCIe power that I currently have powerd even though the risers have their own power, but I also tried without the PCIE power cable plugged to the board)
i5-9600K CPU
128 GB SSD for HiveOS (plugged to SATA 1 port)
2 x 8GB RAM
1 x M.2 to PCIE adapter
2 x 1000 watts power supplies
7 x PCIe risers (tested and confirmed that they work). All risers are powered from the VGA 6 pin cables coming out of the PSU (no molex or sata cables here)
All risers are powered from the same power supply that is powering the associated GPUs
3 x RTX 3060s
4 x RTX 3070s
HiveOS 0.6-203@210410 with NVIDIA driver 460.67 >> miner is set to start with 15 seconds delay, and the OC settings are set to kick in after 120 seconds.

My BIOS is setup to Gen 1/2 and 96 for the PCIE, disabled everything that I dont need (audio controller, on-board video, virtualization, etc.)., enabled 4G, Windows 10 WHQL Support to UEFI, changed power settings to turn on, disabled serial ports…

I have tried the most up to date BIOS for the MSI Z390-a-pro, and also tried a few of the older versions. I think the latest BIOS firmware was the least stable and inconsistent. Currently running a version from 2019.

Here is a sample of what it look slike when one of the card is dropped. This time it took only about 2o minutes after the restart. Before that it took 2 hours…

I have also played with the overclocking settings and it does not seem to matter. In fact - I think that the longest ir ran for (36+ hours) was with higher OC settings.

Two observations:

I noticed that the motherboard clock is off. I changed the CMOS battery and reset the board, but when I go and adjust to correct date and time, it does not retain it. It does not even register it as a change in the BIOS so it is not prompting me to save. I have tried adjusting the date and time along with some other changes that I need to make anyways, so I can save and exit, but no luck the next time I go in the BIOS the date and time has changed. It is tyupically a day ahead and several hours off.
When HiveOS boots up I see in the console a warning message that the time is messed up. Not sure if this could be casuing all my troubles.

I am at a loss - any help and suggestions are appreciated!

ratimux · April 13, 2021, 7:25pm

I can now confirm that Windows sees all 7 cards with no problem and can mine, but it is weird that HiveOS drops one card. Is this a bug in the OS?

andybb311 · April 28, 2021, 9:00pm

I’ve been dealing with the same EXACT symptoms on my MSI Z390 gaming plus board with HiveOS. Everthing down to taking hours for a clean reboot. I found pushing the config file finds the rig quickly when I reboot or hard power off and then turn on with my wifi switch remotely. I hate to say this but glad I’m not alone. The dropped lane is 75% GPU 2 with 25% GPU5. I only have 6 on this rig. Switched everything out. Still no avail. Like you, I haven’t ran longer than 36hrs. Based off your windows success, I’m going to try another miner OS prob this weekend. I’ll let you know how I make out

ratimux · April 29, 2021, 3:41pm

I did get a different motherboard that supports 12 GPUs and the issue seems to have gone away, mostly away. I am now thinking this is all caused by something else in my setup. Btw, I replaced all risers and cables and was still having the original issue with the Z390 a pro…
At least now I get a more consistent boot up where all cards show up in HiveOS and it goes a couple of days before it craps out…good luck

Antdabest11 · May 13, 2021, 1:13am

Are you still using the MSI Z390-A Pro.

ratimux · May 14, 2021, 5:48pm

Nope, why?

Johnnmarnell · June 29, 2021, 2:32pm

Not sure if this was fixed yet or not but I had similar issue where one GPU would drop. I found out which one it was took it out and put back in added power back, riser etc. The issue I think I had was the card was not fully pushed down on the bottom so I made sure it was pushed all the way in and tight. Restarted and so far its good. So maybe just take it out and put it back in.

cuancuan · February 12, 2022, 7:50am

Hi all,
I’m newbie in mining. I have the same problem, 1 gpu will detected error in few hours.
I already change the riser etc but still the same.

I confused about the error. Logically, if it error ( not mining ) why all the gpu still hot. And then i use power meter, the result is power consumed the same as 6 gpu power. And the reported hashrate & realtime hashrate also the same amount as 6 gpu.

Anyone already resolve this problem?

system · April 4, 2023, 10:51pm

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.