I frequently get errors from two of my cards in my rig of 12. It’s either GPU0, which is a 3060ti, or GPU11, which is a 3070. T-Rex Miner gives me these lines frequently:
WARN: NVML: can't get fan speed for GPU #0, error code 999
[FAIL] 39/43 - Low difficulty or invalid share, 34ms ... GPU #0
When I try to modify overclock settings in the web GUI I get this type of error:
=== GPU 11, 10:00.0 GeForce RTX 3070 7982 MB, PL: 100 W, 270 W, 300 W === 16:49:31 SET POWER LIMIT: 125.0 W [Unknown Error] (exicode=123) Max Perf mode: 4 (auto) ERROR: Error assigning value 95 to attribute 'GPUTargetFanSpeed' (NB1:0[fan:16]) as specified in assignment '[fan:16]/GPUTargetFanSpeed=95' (Unknown Error). ERROR: Error assigning value 95 to attribute 'GPUTargetFanSpeed' (NB1:0[fan:17]) as specified in assignment '[fan:17]/GPUTargetFanSpeed=95' (Unknown Error). ERROR: Error assigning value 0 to attribute 'GPUGraphicsClockOffset' (NB1:0[gpu:11]) as specified in assignment '[gpu:11]/GPUGraphicsClockOffset[4]=0' (Unknown Error). ERROR: Error assigning value 0 to attribute 'GPUMemoryTransferRateOffset' (NB1:0[gpu:11]) as specified in assignment '[gpu:11]/GPUMemoryTransferRateOffset[4]=0' (Unknown Error). Attribute 'GPUFanControlState' (NB1:0[gpu:11]) assigned value 1.
I run on the latest stable, but has also tried latest beta. A reboot seems to fix it, but it comes back quickly on either GPU #0 or #11.
These kind of problems are often caused by bad risers, wiring, connections, … Try switching cabling with neighboring cards and see if the problem stays with the same card or has moved to neighboring card…
Thank you! It seems to have to do with the riser not recovering from OC out of GPU bounds. After a lot of experimenting it seems that once I happen to tweak the overclocking out of the working range, so that I’m getting errors, the riser has to be power-cycled for the unit to start working again. A simple reboot is not sufficient, the riser needs to be powered off. And, of course, before powering it off I need to restore a working configuration in HiveOS again, or the problem will be right back when the miner starts, even if I issued a power-off.
The remedy for me seems to be:
Restore a known working configuration for the card
Power off completely
Boot up
It will probably work with replacing the last two steps with “Shutdown & reboot in 30 s” from the Hive web GUI. I also tried (as a proof of concept – not recommended!) to power off the riser while running, pulling its plug, and then soft-reboot, which also worked (but may be harmful to the hw).