GPU Driver Errors and 6600XT?

MalsBrownCoat · October 8, 2021, 10:18pm

Mining since 2018 (Windows/Nvidia), and I like to think that I’m pretty well versed, but this is my first time using HiveOS/AMD and I’m at my wits end trying to figure this out. Fair warning - I tend to be thorough with my troubleshooting and documentation, in hopes that such information will avoid suggestions of things that have already been attempted.

6x6600 XT (4x Powercolor Red Devil/4x Sapphire Pulse - all brand new)

OC = 980 core, 1140 memory, 650 VDD, each card giving 32.64 MH @ 50 watts (avg)

HiveOS version 06-210@201107 with AMD 20.40 (5.11.0701)

Ran fine for ~ 14 hours. The next day, I noticed that at some point overnight, the rig had crashed, showing “GPU driver error - no temps”.

Most research that I’ve found on this error seems to be with Nvidia cards, but I started troubleshooting 101 and removed all risers from MB (Asus Z590-P with Intel I5-10600K) and began testing cards 1x1 to see which card(s) had failed. Started with PCIE slot #1.

Testing -
GPU 0 (Powercolor) - pcie slot 1 - GPU driver error - no temps
GPU 1 (Powercolor) - pcie slot 1 - working
GPU 2 (Powercolor) - pcie slot 1 - GPU driver error - no temps
GPU 3 (Powercolor) - pcie slot 1 - working
GPU 4 (Sapphire) - pcie slot 1 - working
GPU 5 (Sapphire) - pcie slot 1 - working

Then tried other slots for GPU 0 -
GPU 0 - pcie slot 2 - GPU driver error - no temps
Swapped riser on GPU 0 with known working riser
GPU 0 - pcie slot 1 - GPU driver error - no temps
GPU 0 - pcie slot 3 - GPU driver error - no temps

Verified other GPUs were working in other slots -
GPU 1 - pcie slot 1 - working
GPU 3 - pcie slot 2 - working
GPU 4 - pcie slot 3 - working
GPU 5 - pcie slot 4 - working

Swapped known working power cable and riser (taken from GPU 1)
GPU 0 - pcie slot 1 - GPU driver error - no temps
Rebooted.
Changed OC settings to 900 core, 1339 memory - working @28.66 MH
Changed OC settings to 900 core, 1339 memory, VDD 700 - rebooted - working @28.66 MH
Changed OC settings to 980 core, 1339 memory, VDD 700 - working @28.66 MH
Changed OC settings back to original 980 core, 1140 memory, VDD 650 = hash dropped to 0. Rebooted.
Changed OC settings back to original 980 core, 1140 memory, VDD 700 = hash still 0. Restarted miner. NBMiner would not start. Rebooted.
Changed OC settings to 900 core, 1139 memory (previously working) - NBMiner would not start. GPU driver error - no temps
Reset all OC to 0. Rebooted.
GPU 0 - pcie slot 1 - working @28.66MH (stock clocks are reporting 1350 core, 1000 memory, using 60 watts)
Changed OC settings to 960 core, 1155 memory, VDD 660, VDDCI 640, MVDD 1300 = hash dropped to 0. Rebooted.
Changed OC settings back to 0 (stock). Deleted flight sheet. Created new flight sheet. Rebooted.
Changed OC settings to 965 core, 1134 memory, VDD 675, VDDCI 637, MVDD 1275 = hash dropped to 0. Rebooted.
Changed OC settings back to 0 (stock) - working @28.66 MH.

At this point, I can’t seem to make any changes to the OC without a failure to start mining and throwing a GPU Driver Error.

I’m confused as to why the OC would work just fine for ~14 hours, then suddenly no longer be able to be applied? I get when OCs are too aggressive and aren’t sustainable. But that shouldn’t affect the ability to apply any OC after that.

And why only on 2 cards? Now, I understand silicone lottery. Hell, I can even concede to one card shXtting the bed, but two?.. Both of which are brand new? 0_o

MalsBrownCoat · October 8, 2021, 10:19pm

Apparently as a new user, I can only embed one image in the initial post. Here’s a screenshot of what I had woken up to with the “GPU Driver Error” and subsequent output.

edwinst31 · October 8, 2021, 11:15pm

hello bro, your problem is the same as mine.

previously I used 6x RX 6600 XT (3x Asus Dual and 3x MSI gaming X).

what you need to try first is to change the driver to the stable version 6-208, it will work fine and very stable but there is 1 problem that is your hashrate will be locked at 28 mh/s,

no matter you want to change the OC to any number the result will always be 28mh/s for rx 6600 xt.

but if i change hive os version to version 6-209(beta) or 6-210(beta latest version), my rig after restart/reboot won’t be able to boot into teamredminer (im user TRM),

the problem status that comes out is “GPU Error no temps” because I use VDD to 640, and VDDI to 620, but if I change VDD to 800 and VDDI to 750, I can log in and start mining but every hour (1-3 hour) my rig will restart (but the status is not GPU error no temps, but GPU detected dead).

in conclusion you have to use the stable version 6-208 , if you want to mine stably without any problems, you set VDD to 600 VDDI 600 there will be no problem, everything is stable (but hashrate locked at 28mh/s)
if you are using hive os beta version or your latest version will experience GPU error no temps / GPU detected dead problems, and this very bad.

I’ve been looking for 4-5 days for a solution to this problem but to no avail.

If I think about it, first here the driver is not stable for the 6600, and the second is a motherboard problem. you can replace the motherboard type S37, I carry a lot on this forum. many problems were solved by changing the motherboard to S37 type.

My error , for GPU error no temps, always fail for booting to TRM

poleclimber111 · October 14, 2021, 3:31am

Try setting your VDD at 750. For me the error went away and ran fine. May just be something the software just doesn’t like.

mogliettazza · October 16, 2021, 3:10pm

this happen to me, i tried everything but at the end was an easy fix,i remove the OC,and F sheet,delete all the OC configurarion,reboot,make a new Sheet redone OC same as before and boom all good

my oc 1400 /640/620 /2270 temp is around 52-60 - 57w with kawpow
hope this help

anyway now im worried to add more card,because the problem happen to me after i turn off the rig to do some cable management and change the psu is a brand new rig and this is the first card i installed,i already regret i bought this card,find at good price

MalsBrownCoat · October 17, 2021, 7:55pm

It seems that no matter what values I try in the OC, the GPU will start mining at stock rates (about 28.6 MH), then within a few minutes, the hashrate drops to 0.

I’ve even tried changing to several different versions of HiveOS, both beta and stable variants, including the latest beta; 06-210@210913. All of which have yielded the same result.

I’ve scoured forum thread after forum thread and at this point, it really seems like it’s the blind leading the blind here. Just a lot of speculation and luck on whether or not something sticks.

mogliettazza · October 19, 2021, 1:28pm

did you try to complete remove the oc?
go in the oc section, “clean” oc, reboot
than apply your setting,the one i posted above work strong for 3 day now,just add amother card(1660 super) with no issue,will see what happen when add more,the 6600tx i like for the ultra low w only and i not thing im going to buy more of this card

MalsBrownCoat · October 19, 2021, 11:09pm

Before making further changes, I ran the stock settings (everything at 0) for 2 days and everything seems to work just fine.
I just “cleaned” the OC, and removed/recreated the flight sheet, then rebooted.
It’s currently running on LOLMiner and began hashing at 32.27 MH.

…for about 8 minutes, then threw a hardware error.

MalsBrownCoat · October 19, 2021, 11:10pm

A minute later, and I was back to zero with a full crash.

I never had problems like this running nvidia/windows. Very disappointing.

mogliettazza · October 20, 2021, 12:51am

this is a benchmark rig were i test component software etc…

i put a 1660 super and 6600 tx and try both ETH and RVN algorithm

i delete all the OC and start fresh with HIVEOS beta version
all good so far at the time of the pictures was about 10 min run fine
now is about 20 min no issue

i can also jump from ETH to RVN with no problem,just need to reboot and change Flight sheet and OC
hope this help

mogliettazza · October 20, 2021, 12:58am

here Fsheet and OC

26 min no problem so far

in KAWPOW worked for 3 day no issue now i put on ETHash

MalsBrownCoat · October 20, 2021, 1:15am

These are great details, and I do appreciate them, but I don’t consider 26 minutes to be “stable”.
In my experience, establishing true “stability” can be fairly random.
An OC that worked for months could suddenly decide that it doesn’t want to work anymore.

For example, I switched back to NBMiner (because LOL simply wouldn’t start, even after following the steps you suggested). I went ahead and removed the card that I had previously been testing and instead, connected 4 that seemed to be fine when I first started the initial troubleshooting.
This time, with NBMiner, things were running ok.
And 35 minutes later, it crashed.

mogliettazza · October 20, 2021, 2:44am

touché, you absolute right,even 1 day cant be considered stable,

i bet you already try another riser usb cable etc…

what i would do if nothing work is disassemble the rig and try part for part and find the faulty one

all your card are 6600tx ,can be that one of the card not work properly ? defective ?

i not like this card, at that hashrate im in love with 1660 super or 3060 (even lhr) i know that are a bit high on w but on my personal experience the 6600tx give me trouble and those not(at least so far)
plus price point are a bit cheapper and easy to find new and used

you try already to change pcie or reinstall hive os?

i not try to waist your time but im out of suggestion

mini_miner · October 20, 2021, 3:28pm

Hello,

My system was working fine for a couple weeks then the SSD died. The SSD is something like 10 years old… It was only a matter of time.

Once it went out, I grabbed a super cheap SSD and the rig wouldn’t even boot. So, I grabbed a 1T NVME drive as this is all I had laying around (i’ll need to buty another SSD but in the mean time, I’d like to get my rig in operation)

My issue is with my original SSD I was able to get 32 Mh/s on 6 6600xt consuming 45-ish watts of power. Now, I can’t get above what the screen shots are showing below. I have highlighted a few things that show you that I have adjusted the overclock setting manually; however, the memory settings refuse to change…

does any one have any advice?

oh, and I can only get 4 GPU to boot with OS not the original 6. When I plug in the 5th one the system refused to finish POST. I have enabled 4g Mining in bios; however, that made things worse. The rig won’t boot with even just one GPU with 4g and BAR size enabled.

I have even edited the miner config file… No Dice Tango.

I am lost.

Any ideas?

MalsBrownCoat · October 21, 2021, 12:31am

Thanks for the suggestions, Luigi. I had already tried swapping out components that were known to be working (risers, power cables, pcie slots on the mb, a complete reinstall of Hive on a good SSD, etc). I’m beginning to think that 2 of these gpus are just defective. I can completely understand if a few cards don’t like certain thresholds of over (or under) clocking. But the two in question don’t seem to accept any OC whatsoever, and they take the whole rig down with them when they crash.

Like my earlier comment, I don’t really consider even a full 24 hours of continuous mining to be classified (with certainty) as “stable”. However, after removing those two Powercolor GPUs, the rig has been running since my last post. I noticed that it did stop mining for about an hour early this morning, but it appears that it recovered on its own without any intervention on my part.

I just took these screenshots -

Notice the dip to “0”, but then it recovered, followed by another couple of brief areas where it didn’t report at all.

A screenshot of NBMiner indicates that at some point (likely right when those dips occur), at very least, NBMiner restarted itself (otherwise it would show ~23 hours of the activity).

Note that in these shots, GPU’s 0 and 3 are also Powercolor’s, whereas 1 and 2 are Sapphire Pulse’s. So far, I have not had any problems with the Sapphires, it has always been a Powercolor that crapped the bed. The two Powercolors that are showing here are different ones than the two that were having the earlier problems though.

So, maybe it’s a simple as “bad cards”. Not sure I’m really buying that though. One? Ok. Two?
…mmm…I dunno about that…

I do have a few more Sapphire Pulse’s that I think I’ll add to the rig. I suppose if those run for a few days, maybe it really was those two Powercolor’s. Suffice it to say, I have a few more days left to return those, which I think I’ll just go ahead and do.

MalsBrownCoat · October 21, 2021, 12:46am

mini-miner - when you loaded the (new) NVME, did you load the exact same HiveOS version, or was it by chance an updated (or earlier) version than your rig.conf file was associated with?

One of the reasons that I question this is because (if I recall in another thread), your OC seems to be locked at 1000 Mhz and I believe that was an issue on prior/stable (rather than beta) versions.

I’m just going to spitball here, but to rule out other variables, have you also done the steps that mogliettazza had shared; remove the flight sheet/“clean” the oc/reboot/apply new flight sheet/apply new OC

As for 4G decoding, that’s been a bit of a standard over the last several years with mining motherboards. BAR settings should not be required and may in fact hinder performance (though I have no concrete evidence of this). The point being; motherboards have supported more than 4 GPUs just fine without the need for anything BAR related for years.

And I don’t mean this to sound remedial, but have you been able to rule out a problem with that specific 5th GPU? What happens if you take one of the four that are currently working, and use it in one of their places? I would also repeat that test for the specific 6th GPU in question.

mogliettazza · October 21, 2021, 2:06pm

Very interesting,so i was almost sure that you tried already to rebuild a rig one component at the time but you know, i had to ask,could be a bad card, those maybe have a upgrade bios?
maybe just better return it and buy something else,to me personally the better card is not the one with more hashrate but the more stable,and less w because the last thing i want is headache,i want focus more to research and buy more gpu(for how long we can mine eth,that is another thing…)

yes i load the same version/worker/conf file the only think i did was change the ethernet cable for a commercial one because i run a wifi ethernet splitter and sometime got me problem but now no more.

im not a fan of the beta version but i wasn’t able to make the 6600tx work on the stable version,no way just wont work,so i build a new rig with a b550 plus and a syoncon usb 4 pcie splitter with beta version and after set oc and new fsheet all seem work fine for 3 day now with 32 hashrate firm and 49 w 20% fan,no bad at all but i did not trust this card,period 48/58 temp
5.10.0-hiveos #60
Kernel Version
A20.40 (5.11.0825)
N470.63.01
hiveos 0.6-210@211010

my favorite setup is a 110 btc asrock 13 gpu mboard with stable hiveos version (but no way to make the 6600 work) but telling you the truth i only put 12 gpu,why?no reason i never like to overload anything i want just keep a bit lite,i wish i can buy better card yes i do,but i not want spend scalpel price for high end card,is just no fair.

today or tomorrow i will install another card on the new rig and i will see what happen,im pretty sure the 6600xt going to lose the configuration and i have to re do the oc ans flighsheet but this time if happen no more time waisted i will return the card and just go with another 1660 s,im not lazy but i cant and want be concerned everyday.
a soon i do that i let you know what going to happen,im waiting for a 1660s and a 3060 lhr
going to be here to papa anytime now

as right now,how your rig going?what card you left out of the rig?(if any)
look at your pictures you notice that any 6600 work at different temp?

edwinst31 · October 21, 2021, 10:48pm

your problem lies in the hive OS version you are using.

you are using hive os stable version 6-208, and it does not work well for VGA RX 6600 XT, the maximum hashrate you can get is 28.5 mh/s, no matter you try to set your overclock, it will still get 28.5 mh/s .

download the latest beta version on the hive OS website, as far as I know the latest beta version is version 6-210.

use that version, and you will be able to adjust your OC, but you need to remember, this RX 6600XT is a new VGA, there is no version that is completely stable.

if you use beta version, you will get problem, settings are very sensitive. but the hashrate you get can reach 32-33 mh/s.

mogliettazza · October 22, 2021, 12:11pm

perhaps i get 32,and sure the card is about feb 2021 but the problem here is not the hashrate, only

MalsBrownCoat · October 23, 2021, 8:12pm

Just an update - after removing those two problematic Powercolor Red Devils, I replaced them with a pair of Sapphire Pulse cards and things seemed to be going well for about 3 days. I woke up this morning and saw that at some point last night, the rig had crashed. I hadn’t enabled the watchdog on it yet, so that explains why it never restarted. But, since it did crash at some point, the clocks may need a slight adjustment somewhere.

I don’t really have the time to futz with it right now and I’ll be on a work trip for a week (I made sure to setup the watchdog in the meantime). I’ll provide an update when I return. Appreciate everyone’s insight on things so far. Thank you.