Today I started trying to shift my main Nvidia rig over from ETH to NEOXA (kawpow). I set my gpus to the standard presets shown on HiveOS, however I have been running into an issue with a few of my cards.
I have a 12 GPU system rig which has been running find for many months now. However, when I started up NEOX three of my cards would begin to instantly crash. This consists of one 3060Ti FE and two EVGA 3070s. They all have this same error (although slightly different Core and Memory depending on whether 3060Ti or 3070)
(exitcode=3)
SET GPU CLOCKS: 1380 MHz [Not Supported]
(exitcode=3)
Max Perf mode: 4
ERROR: Error assigning value 0 to attribute ‘GPUGraphicsClockOffset’
(Nvidia_Main:0[gpu:2]) as specified in assignment
‘[gpu:2]/GPUGraphicsClockOffset[4]=0’ (Unknown Error).
ERROR: Error assigning value 1800 to attribute
‘GPUMemoryTransferRateOffset’ (Nvidia_Main:0[gpu:2]) as specified in
assignment ‘[gpu:2]/GPUMemoryTransferRateOffset[4]=1800’ (Unknown
Error).
ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
(Nvidia_Main:0[fan:3]) as specified in assignment
‘[fan:3]/GPUTargetFanSpeed=100’ (Unknown Error).
ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
(Nvidia_Main:0[fan:4]) as specified in assignment
‘[fan:4]/GPUTargetFanSpeed=100’ (Unknown Error).
Attribute ‘GPUFanControlState’ (Nvidia_Main:0[gpu:2]) assigned value 1.
(exitcode=100)
They claim that the given core and memory values are not supported despite these values being well under standard, working on all the other cards, and in the case of the 3060Ti working for ETH.
I will post the rig specs with it working on ETH and my overclock profile for NEOX below:
Here is the current settings when running ETH.
All cards work and none of them crash. When running NEOX / RVN the ones that crash are:
The second 3060Ti (first FE)
The 3070 with the 300W power limit
The 3070 with the 264W power limit.
I am working to update the Nvidia driver now to the latest 510 build.
Update: Driver is now on 510.85.02. WIll test on ETH before trying NEOX / RVN again with latest driver.
Ive checked the exitcode 3, this is typically caused by lack of system memory for starting a given GPU. However, whether I run ETH or NEOX HiveOS shows 5GB of free memory space.
I checked Max Perf Mode: 4 and this is the same setting I have on ETH.
I checked exitcode 100, and while no specific error it typically displays when the GPU cannot be found.
I did find other people with similar issues although either never exactly the same setup.
(Some have 12GPUs / same or similar board but not using HiveOS while others using HiveOS are having this issue with 6 or 8 gpu systems.)
That error can be from a gpu crashing, and not applying the clocks requested. Lower oc on the highest card in the list with any errors and reboot. Repeat until stable
Updating Driver seems to have fixed the one 3060Ti. As it was able to start this time.
Edit: Same 3060Ti is not consistent, sometimes it will start up and sometimes it will not. I have dropped its settings too. Unsure what is causing this.
I have lower the two non functioning 3070s down signigicantly by setting core to -200 and memory to 1900 compared to the default (0) and 2100 they were set to before but no change. Im not sure what would be causing on these two and not any of the other 3070s. They performed the exact same with same settings as all the others on ETH.
Im going to try changing mining software and see if that improves anything. Currently using t-rex but will try nbminer and lolminer.
I think it might be a power issue as NEOX drastically increases power usage. I will try changing power supply and uplug my A2000s to open some power. If this does not fix the issue then I will go back to trying to determine what it could me.
I believe I have found the issue, it comes down to two likely possibilities:
First, the gpus were not starting because there was insufficent power for them to be started on. Kawpow increased my power draw for my 3060Tis and 3070s by about 30-40W each. This means I may have been power capped.
Second, the software for running kawpow was not able to initialize more than 10 gpus as it was only ever two that would be down at any given time. I am now running a system without the two A2000s and all GPUS have started successfully. If anyone has a similar issue make sure that:
You have enough power taking the higher power requirements into account
Dont go above 8-10 gpus on a given system.