MSI RTX 3090 x Trio: Reboot of Trex miner after every hour or so

zeek · November 19, 2021, 6:32am

I am having issue with Trex miner rebooting after every hour or so, even with the lowest OC setting.
The log do not show any form of error, but sometimes 1 of the GPU fan shows err, in that case I have set the Autofan setting, if any error to reboot. But the log doesn’t show any error of any kind. The miner uptime clock restarts.

I was wondering what could be the problem? I am about to increase my mining farm to 14 more RTX 3090s, I would really appriciate if someone could help me with this issue, before my expansion, and if I should be worried about this issue at all?

This by the way happened after the upgrade of hiveos version from 0.6-211@211112 to
0.6-211@211117.

Here are the equipments that I am using:

Current setting:

Fan: 80% - 85%
Average Temp: 46 degree C
Core: -300
Mem: 2600
PL: 300
Average MH per GPU: 123 mh/s

Equipment used:

2 MSI RTX 3090 X Trio cards are 24268 MB Micron GDDR6X
1 MSI RTX 3090 X Trio card is 24267 MB Micron GDDR6X
Motherboard: X470 GAMING PLUS MAX (MS-7B79)
PSU: HX1200 corsair
8 GB Ram
Ryzen 5 cpu
128 gb mini SATA or ssd
3 gpu riser

keaton_hiveon · November 19, 2021, 4:30pm

use a locked core clock around 1130mhz and higher fan speed, no pl needed.

if you get an error after doing the above mentioned things, lower memory clock and reboot each time.

zeek · November 19, 2021, 7:37pm

Let me try that and see for a day.

zeek · November 19, 2021, 8:24pm

So, when I applied the suggested setting, the changes that occurred:

Hash Rate per GPU has increased from 123 to 124.1
I have Autofan setup so the temp is still 43 degree C
Power consumption decreased from 900w to 863w

So far, looks good, will update regarding reboot issue after I have more data.

Thank you,

keaton_hiveon · November 19, 2021, 8:38pm

if it drops below 124mh set the fan speed to 100, the memory on those runs really hot and the autofan goes by core temp, not memory

zeek · November 19, 2021, 8:42pm

Thank you for the info regarding Autofan. The temps are around 38-43 degree C, so far, and hash rate is on point 124.1. I have increased the min fan speed to 90%.

Is is ok to go up to 100% fan speed? As my rig runs 24/7.

keaton_hiveon · November 19, 2021, 8:55pm

Fans are cheap. I run all of my gddr6x cards at 100%. Some for over a year now.

zeek · November 20, 2021, 2:01pm

Updates:

The service is running perfectly for 6 to 7 hours, The miner uptime resets after 6-7 hours now.
Is that normal?
The hash rate remains stagnant at 124.1 per GPU
I have maintained the fan speed to 90%
The temps are still on 39-43 degree C

Possible issue:
The hash stat graph doesn’t show any downtime. Is it an issue of Hiveos application to exchange data between Trex mining pool server, Hiveos server and Client side server? Could that be a possibility?

keaton_hiveon · November 20, 2021, 2:06pm

miner could restart for various reasons, but only takes a minute or so to do so, not enough time to impact the stats graphs

zeek · November 20, 2021, 2:24pm

~~Could~~ ~~you~~ ~~help~~ me ~~understand~~ ~~the~~ ~~possible~~ ~~reasons~~ ~~for~~ ~~the~~ ~~miner~~ to ~~restart~~? So ~~that~~ I ~~could~~ ~~look~~ ~~into~~ it.

keaton_hiveon · November 20, 2021, 2:27pm

could be due to hot memory modules, or just too aggressive memory clocks, or a faulty riser/cable, cpu issue etc etc. turn the fans to 100 and see if it helps.

zeek · November 20, 2021, 2:30pm

Ok I will try that, and if not as you suggested will try lowering the mem clock too. I will update after 24 hour of testing. Thank you.

zeek · November 21, 2021, 7:44am

24 hour observation: after change of fan speed and few other adjustments to miner:
Changes:

Fan speed changed to 100%, but shows 99% on hiveos application
GPU temps are down to 32- 40 degree C (Night); 42- 46 degree C (Day)
Updated to hiveos version 0.6-211@211118, it had update of TBM 20% boost on LHR GPU
Tested by shifting miner to TeamBlackMiner (TBM)
After shifting miner to TBM I have increase of Mh/s from 124 to 129.9 mh/s
Miner has not rebooted for 10 hours now
After the shift to TBM, I have 99.81% efficiency on accepted shares, whereas in Trex miner I had 99.9% accepted shares.

So, far things look better. Hopefully it stays this way. Thank you for you suggestions, seems the performance of the rig got better.

I will update all future changes. Will be adding 14 more RTX 3090 GDDR6X GPUs gradually, due to the availability of cards. Once i have more GPUs I will add air conditioning to the room. Hopefully people find the info provided helpful from keaton_hiveon

keaton_hiveon · November 21, 2021, 8:19pm

happy to help!

smarkhive · November 22, 2021, 8:58am

I am having the same issue (great thread!!)

for the 3090 (Founders), what does the 1130MHz correspond to in the negative format ie i am using -300 and 2000 for the mem (I think 1130 is -310 but am unsure)

many thanks

keaton_hiveon · November 22, 2021, 2:16pm

Just use 1130mhz instead. Core offsets just waste power and make it less stable.

smarkhive · November 23, 2021, 11:03am

thank @Keaton
yes I was told that offsets are less efficient than the absolute value.

I was asking and hoping to find a look up table to know which corresponds to which

eg for the 3090 FE, does my -300core correspond to what you say to use as 1130? I don’t think so as 1130 doesn’t see to work for me whereas -300 does (I use absolute values on my 3080s)

or, i guess, put another way. what is the default core speed for the 3080 fe !

thank you !

keaton_hiveon · November 23, 2021, 1:30pm

Cards don’t behave the same with offsets and locked core clocks, offset will generally have the core clock fluctuate some depending on temp/load etc. you can use nvtool to determine current clock speed, but again, you don’t need more than 1130mhz on a 3090 to get 125mh, so there’s no reason to work it harder and generate more heat than needed by running anything higher. (Unless you’re trying to set some record on ln2 or similiar)

Are you saying you have a faulty card that doesn’t take a locked core clock?

smarkhive · November 23, 2021, 3:16pm

ah
1130 core and 2400 for the mem does increase it to 121 and a bit. it just took quite a while to get there!
edit: after a further while it went down again.

2400 makes sense as it’s the same memory setting on the 3080 right? and my 3080’s generally run best at 1050core/2400mem

3090 is 1865 MHz by default
3080 is 1920 MHz by default

keaton_hiveon · November 23, 2021, 5:24pm

is your card throttling? are you running 100% fan?

i aim for 100mh/s on my 3080s, which requires more than 1050 core, i run mine at 1080.
3090s i aim for 125, i run 1130/2700+, which nets me 124.6+