Hey guys. I think I have some solid oc settings and really good mem temps (76). After 10 days of mining without problems, GPU3 died, HiveOS made automatic reboot, all works fine again after reboot…propably for another 10 days, who knows. I recognized some hours after it happened.
Is there a reason to be afraid or to change settings? 10 days is a long time. Or can it always happen that a card crashes after 10 days…independent from settings?
A GPU being detected as “dead” or “stuck” are generally due to overclocks being too extreme or not enough voltage. I recommend increasing voltage in steps of 6 until it is stable.
Without a full list of your overclocks & cards, I can’t say for certain. It could be miner related as well, but most likely it is relating to overclocks.
Also, I highly recommend joining the HiveOS Discord channel, mainly because I am far more active on the channel then the forum and answer questions pretty quickly.
So I have 3 5700’s(all 3 MSI, 2 flashed to XT BIOS) and 1 5700XT. I have had this issue for a while, started off every hour. Through troubleshooting, I’ve gotten it to go days without reporting a GPU dead. I don’t think it’s a single cause. I have found that certain risers are better for certain cards. For instance my 580’s are better with Sata to 6-pin connectors. Some of my cards perform better with a Molex PCIe riser connector. Sometimes switching the power of the riser can fix it. 6-pin to 6-pin will always be the preferred way to power a riser.
It’s possible your OC’s are too much, this one I couldn’t verify though, I would move from 900 to 895 and it would seem like it’s fixed, but then I’d increase it back to 900 and it would still perform like it’s fixed so I can’t verify that.
THE BIGGEST success I had was actually heat. Replacing the thermal pads, adding plastic washers to the back of the heatsink to apply more pressure, and then putting the rig in front of a big box fan in front of a window (it’s pretty cold where I live). Now it runs days on end before a dead GPU.
ELI5: A Combo of exterior/interior temperature of the memory modules, PCIe riser power module, or OC values.
I found it…replaced 2 riser cards and now its rock stable again…puh…that risers were only half a year old…the other 2 are ok…so 50% of the risers dead after 6 month…argl…