I run an SSD, my 3080 fan speed only shows up for a few minutes after boot up, then it’s 0. This has happened since day one.
Any fix for this already? Im having same issue on a 2080ti
I couldn’t find anything but rebooting the rig when it happens.
Sometimes I have to reboot several times.
Looks like I found a solution for this issue. Still testing for several days, but looks this is better. New beta NVIDIA driver https://www.nvidia.com/download/driverResults.aspx/181167/en
Same issue here, with a 3080. I noticed it doesn’t affect anything other than 0% fan reading, so I just leave it as is
It depends on the gpu. I have aorus that when it’s too cold it automatically turns off the fan, the solution is to let the autofan forcing it to stay below the minimum temperature that the gpu is normally, so the fan will never turn off, or not use autofan, let the gpu control itself or let it fixed speed.
Because when you need to get the fan speed to keep making automatic adjustments, if it is turned off it will give this error.
I’m with Palit now giving this problem, I changed the riser, I changed the machine and everything else. But it will certainly be the same thing, I need to leave her without an autofan.
The problem occurs on any linux, I use 3 systems and it occurs on 3.
Has anyone else tested this beta driver? Same issue here, 3090 that fan goes to 0%, only a reboot fixes it. But it’s not just the fan for me, if you go into the miner screen you can see that it can’t read the Core and Mem speeds either:
+---+-------+----+-----+-----------+--------+----+------+
| ID GPU Temp Fan Speed Shares Core Mem |
+---+-------+----+-----+-----------+--------+----+------+
| 3 3090 44 0 % 123.18 MH/s 391/0/0 0 0 |
Then if I run nvidia-smi I can see:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 3 GeForce RTX 3090 On | 00000000:07:00.0 Off | N/A |
|ERR! 44C P2 289W / 300W | 4933MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
So this is clearly suggesting is not a bug HiveOS but a problem with the card’s controller or the Nvidia driver.
So a few more research points. I found this thread on the Nvidia Developer forums which seems to describe the issue we have. I tried to perform a GPU reset using “nvidia-smi --gpu-reset --id=3” but it kept giving me the “GPU 00000000:07:00.0 is currently in use by another process” error even after I stopped the miner. After restarting the miner the GPU overcloking for the card with 0% fan failed with these errors:
=== GPU 3, 07:00.0 GeForce RTX 3090 24268 MB, PL: 100 W, 420 W, 450 W === 23:26:33
SET POWER LIMIT: 300.0 W [Unknown Error]
(exitcode=123)
SET CLOCKS: 1150 MHz
Max Perf mode: 4 (auto)
Attribute 'GPUGraphicsClockOffset' was already set to 0
ERROR: Error assigning value 80 to attribute 'GPUTargetFanSpeed'
(Mining_Rig1:0[fan:6]) as specified in assignment
'[fan:6]/GPUTargetFanSpeed=80' (Unknown Error).
ERROR: Error assigning value 80 to attribute 'GPUTargetFanSpeed'
(Mining_Rig1:0[fan:7]) as specified in assignment
'[fan:7]/GPUTargetFanSpeed=80' (Unknown Error).
ERROR: Error assigning value 0 to attribute 'GPUMemoryTransferRateOffset'
(Mining_Rig1:0[gpu:3]) as specified in assignment
'[gpu:3]/GPUMemoryTransferRateOffset[4]=0' (Unknown Error).
Attribute 'GPUFanControlState' (Mining_Rig1:0[gpu:3]) assigned value 1.
So I think the HiveOS Devs are right, the card’s controller seems to end on an invalid state that only a reboot can clear. I doubt this is hardware related (aside from the card) given the amount of things other people on this thread have changed without success. I doubt it’s temperature related, few have already suggested it’s not happening on their hottest cards on their rigs. Also look at my temperature, it’s 44C, so card couldn’t be more cool than that! I think I am going to let this one go, it doesn’t seem to affect mining so not sure what else it’s worth doing…
Hey man:
I’m having the same issue with one of my 3090’s. I changed the riser and cables, updated the driver, and nothing. It came out of the factory with 470.72, and now I’m running 470.86; but nothing helped. After it goes offline, I notice that my fans start spinning faster; so I have to reboot it manually. It sucks because sometimes I’m not home, and I check the app, and it’s been offline for hours without me noticing. Hopefully, someone figures it out because It’s super annoying, wasting electricity for no reason at all.
Good luck to you.
Hi Ernesto. I think you have a totally different problem. What people on this thread are describing is an issue where the GPU fan speed stops being reported by the GPU but the card is still present and mining without any issues. If your card is going offline then you are most likely pushing the card too hard, reduce your OC settings until the card is stable. Also you should configure your Watchdogs in your HiveOS console so that HiveOS can either restart the miner or the reboot the system to bring the card back online. Then monitor your HiveOS console to see when reboots happen and check the miner logs to identify which card caused the Watchdog to take action, then continue to lower OC settings in the problematic card until the rig is stable. Have a look at this video: https://hiveos.farm/guides-watchdogs/
Thanks,
Christian
Hello!
I have the same issue with an RTX 3080 amp holo… I change the riser, place/position in the rig etc…It’ s recover after reboot, BUT, in 1-2 -3 h, the error appears again ( no fan percent)… Is like a ‘scars’ on my brain :))
Any solve? Any ideas, news ‘’…’’…
P.S. Please don’t start with … change the riser- i have done it…change cables- i have done it
What i can observed . i can t get more than 95-96 mhs at 207 voltage ( not limited by me)
I will look forward for an answer Bye! Keep mining
hi, flash it with msi bios
I had the same problem with a 3090 amp extreme holo, low power, 107mhs max
now it’s listed as a msi, and it work great
Thank you! I will give a try…I’ll get back with an answer for our community if it’s worked or not for me!
Peace!
By the way, it is only about the power or it will also soleve the fan error?
it’s for power, I doesn’t had the 0% error with this card.
usualy this error come with bad OC / too hot memory
you can check your mem temp on windows,
every time I get a 3080 / 3090 I put it on a PC with windows to test it and check if repad is needed
with the most shitty 3080 (msi supprimx) I got, it’s 94.5mhs with core 1080 / mem 1300 / fan 90%, no thermal (82°) but bad silicon, crash after 1400
with the best (msi gaming Z trio flashed as a supprimx) it’s 104.1mhs with core 1200 / mem 3200 / fan 90% memory @92° max
you can start with the first setting and go up little by little
Ah, I remember something,
before flashing you can try to set high pl
my gaming Z was locked to 94mhs / 207w even after flashed it with supprimx bios, with 420 PL it’s fine
I don’t have a problem with the power limit, i do have a problem with the ‘fan percent’, it’s always show me 0%…it can be solve, temporary, with a restart, but after 10- maybe 1 hour it will show again 0% fan speed…
But the GPU it is mining, that 's the good part
Maybe it will have a solve!
Peace!
it’s a power problem, caused by original bios and / or thermal throttling
Im having exactly the same issues on a ASUS RTX2070 super. works fine for say 24 hours then I log into HIVE and it shows err on the interface and 0%fan. Temp on the card is fine and the fans are spinning. The card is directly into the motherboard for testing and not in a riser.
This issues has only started happening in the last few months after upgrading HIVE OS
Hi, I have the “same” problem, fan speed 0%, and normal mining, but, If I change the OC this have no effect on this card, and continue working at same speed while the other change according new OC values.
Seem to be a comunication problem, but no idea what can I do.