Rig looses data communication after few days (1-6 days) on all GPUs at once.
Miner TeamRedMiner last version
Board: Asrock H110 pro btc+
Power Corsair HX1000 - new
Risers 4 connected with separate pci-e, 1 connected with molex
HiveOS version: latest 0.6-210@210921 (before i had 0.6-209 from sept. - same errors)
GPUs: 4x rx580, 1x rx6600
Here are the log:
[2021-09-24 16:17:02] GPU 3 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 2 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 4 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] Watchdog API alert: API thread seems stuck in states 2 / 3 since 25 secs.
[2021-09-24 16:17:02] GPU monitor alert: gpu driver stats refresh thread is stuck, triggering watchdog
[2021-09-24 16:17:02] GPU 1 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] 1 08:00.0 36 0 0 0 0C 0C 0C 0.00% 0 0 mV 0 W B288
[2021-09-24 16:17:02] GPU 3 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 4 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 2 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] 2 0b:00.0 32 0 0 0 0C 0C 0C 0.00% 0 0 mV 0 W A384
[2021-09-24 16:17:02] GPU 0 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 3 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] 3 0c:00.0 36 0 0 0 0C 0C 0C 0.00% 0 0 mV 0 W B288
[2021-09-24 16:17:02] GPU 1 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 2 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] GPU 4 Gpu monitor stats refresh timed out, no data available.
[2021-09-24 16:17:02] 4 0d:00.0 36 0 0 0 0C 0C 0C 0.00% 0 0 mV 0 W B288
I have this error popping up about every 12 - 24 hours as well. I started getting it when I added a 7th card to my rig, and I am using a 1 to 4 pcie splitter. I have 6 x 6600xt and 1 580 on it…not sure if its related to the 6600’s? My other theory is the MB doesn’t like the 7th card and/or the splitter, but its tough to tell from the error messages if its a software or hardware issues.
Anyway, let me know if you found something…I haven’t solved this one yet.
Yes, i solved.
This error usualy comes when riser gone or power cable (to riser or to gpu) have not good connection, or when something inside gpu is in short circuit, but usualy HiveOs show error for one gpu 511, not for on all gpus at the same time. .
I started with pcie on board with changing positions, than risers, changing one by one, channging cables, then i lower OC setting on all GPUs, and then i start put higher OC settings one GPU by one. I waited day or two before put next GPU higher oc settings. Ridicolous is that i leave rx 6600 for the last one, becouse i have 6 cards rx 6600 in other rigs and all have same OC ±10 up down and i never experienced this error…
I can tell you, before i put this 6600 card in rig i tested she on test computer and she worked with the same OC settings 2 days without any error. Really wired.
I test every card on test computer to find correct OC settings before i put it in the rig. This is the reason i dint’t think OC settings are the case.
I added few more cards in this time in rig, now i have 7. All worked well 6days withoud “dead”… From the moment, when i put higher memory on that 6600, every few hours one GPU is dead… But every time other GPU!! I dont know… I will find OC that will be stable, but will be lower from other 6600s. I have many cards, but i never experienced that i put higher OC on one card and other card goes DEAD. This is really wired. Maybe something wrong with this 6600. We talking about difference 50 in memory settings. with 1100 on memory is ok, with 1130 or higher is not ok. Seems 1120 will be ok, i will see. On test comp she goes to 1160 without dead 2-3 days.
I get the error on all Gpus:
[2021-10-15 22:57:07] GPU 1 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 2 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 3 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 4 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 5 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 6 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 7 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 8 Gpu monitor stats refresh timed out, no data available.
[2021-10-15 22:57:07] GPU 6 Gpu monitor stats refresh timed out, no data available.,
So on all of them, but i found out a line that might explain it
" GPU monitor alert: gpu driver stats refresh thread is stuck, triggering watchdog"
A good thing is that it autorestarted and kept on mining
Its been up now for 10 hours and no errors. Its curious how it works. Today at around 10 am ive added another worker on hiveos with 5 3070s non lhr. For this worker i used a samsung usb stick that i flashed when i woke up today. Now both my workers used the hiveos version given when you flash them, but as soon they’re up and running you can still update the hiveos version, which i did. Now in 10 hours i had no errors so i hope that fixed it.
I have 0.6-210@210921 and AMD driver 20.40 (5.11.0701)
For me now works almost ok. No more 511 temp errors. Just every 1-2 days one of the cards goes dead (i’m not shure if this related to rx 6600, still in fine Oc tuning for other cards). Current Oc for rx 6600 are CORE 1000, MEM 1030, VDD 630 working 32,09MH at 49w. When i will finish fine tuning for other cards and when will be stable for 7 days, i will start pushing rx 6600 more.
i have 2 rig all with 6600. Im having this same issue on teamredminer, another rig im using lolminer, it’s all ok for 3 day now, i will test few more day if stable, then i will switch all to lolminer
I’ve used settings VDD = 640 , VDDI = 640. and the results are not very good, errors often occur (GPU 5 detected dead) when I use it in the red miner team. i don’t know when i use in lolminer,.
I use settings like this, and it’s been stable for 3 days,
Had the same issue as you did on my 9x 6600xt farm. I tried everything, even thought of changing the motherboard. Before i got to do that i thought maybe a new clean OS might help and thats what i did. I run hiveos from a Samsung 32 gb pendrive. There are 2 ways of installing hiveos on your stick and get it running. First is to install it from your main hiveos acount by clicking add worker, copy paste your farm hash and so on. This is what i did the first time i used hiveos. Second(this is the one that “cured” my problem) you need to log off your hiveos acount and go to the main page. At the very top there will be a install button that will take you to install hiveos OS on your stick. So far so good, but remember by starting your miner with the pendrive now will ask for rig id and password. You now need to go back to your hiveos account and click on your worker that is offline and on the settings tab. There you will have the worker id - rig id and the password. Start the miner with the new OS and enter ID and password.
I thought that if i update the system every single time its yellow it helps. On the contrary, it doesnt. I managed to get rid of this issue, but from time to time one of my workers, randomly, goes offline. This used to happend every few hours, driving me mad. Solution for that is to edit your flight sheet and reselect your server. I mean i selected the same server but added the next in line with good response rate, and also clicked on SSL server, whatever that does. But it works, not i get this error maybe once a week… Hope this helps
PS if you updated to the latest version and you revert back to an old version using my method there is a chance the system wont update(or revert) the version of Hiveos system installed(on hiveos site). But hopefully this wont matter and the problem will be gone and you wont update anymore.