Rig Stable for 2 weeks, then today it has rebooted 3 times

niner · March 9, 2022, 11:17pm

Hi All,

Wondering if others have seen this, or perhaps solved it.

I’m running a few 3060 LHR v2 cards on an MSI motherboard. For about 2 weeks they have been mining ETH+ALPH 24/7, using T-Rex, and a lhr tune of 71. Each card is getting 35-35 MHs on ETH and 500 MH on ALPH. The rig has been really stable, but today, without any changes in OS version, oc settings, or anything else, it started to reboot ever few hours. It loses hash rate and the hash rate watchdog reboot after 1 minute. I can also tell something is up because I hear the fans on the card spike - they are locked at 65% with 3 external fans blowing on them, and their temps are consistent at 60-63 degrees.

Any thoughts about what might be causing the reboots?

itsZeroday · March 10, 2022, 12:19am

You would have to check the logs, but most likely it’s an overclocking related issue.

keaton_hiveon · March 10, 2022, 3:08pm

When it loses hashrate is it all cards? Or a specific one? What cards/clocks/driver/versions etc?

niner · March 10, 2022, 3:41pm

It looks like ti bounces back and forth between them. The log below is for the most recent reboot an hour ago. Temps are still stable at 60-63 degrees. In my miner config I am running:

“lhr-tune”: “72.0”
“lhr-autotune-step-size”: “0.2”
“lhr-autotune-interval”: “5”

If I am reading the log below correctly, the LHR has creeped up to 75.6, and if so I bet it is banging on the limiter and that forces the hash rate to drop and causes the reboot. I’ve changed lhr-tune to 69 to see if it still reboots, and if so I will test simply locking the lhr tune and not creeping it up.

-------------------20220310 09:24:45 --------------------
Mining at us2.alephium.herominers.com:1199 [51.81.66.172]
GPU #0: RTX 3060 - 459.37 MH/s, [LHR 75.6<>] [T:66C, P:136W, F:70%, E:3.43MH/W], 79/79 R:0% I:0%
GPU #1: RTX 3060 - 493.52 MH/s, [LHR 75.6<>] [T:64C, P:142W, F:60%, E:3.58MH/W], 88/88 R:0% I:0%
Hashrate: 952.89 MH/s, Shares/min: 7.226 (Avg. 1.159), Avg.P: 272W, Avg.E: 3.50MH/W
Uptime: 2 hours 24 mins 32 secs | Algo: blake3 | T-Rex v0.25.8

20220310 09:24:47 ethash [ OK ] 849/852 - 73.05 MH/s, 31ms ... GPU #1 | 4.36 G
20220310 09:24:47 ethash [ OK ] 850/853 - 73.05 MH/s, 35ms ... GPU #1 | 17.90 G
20220310 09:25:02 ethash [ OK ] 851/854 - 72.59 MH/s, 41ms ... GPU #0 | 23.99 G
20220310 09:25:13 TREX: Can't find nonce with device [ID=1, GPU #1], cuda exception: CUDA_ERROR_ILLEGAL_ADDRESS, try to reduce overclock to stabilize GPU state
20220310 09:25:13 WARN: Miner is going to shutdown...
20220310 09:25:13 Main loop finished. Cleaning up resources...
20220310 09:25:13 ApiServer: stopped listening on 127.0.0.1:4059
20220310 09:25:15 T-Rex finished.

t-rex exited (exitcode=0), waiting to cooldown a bit

keaton_hiveon · March 10, 2022, 6:02pm

What clocks on the crashing gpu?

niner · March 10, 2022, 8:57pm

Core is 1500, Memory is 2750, fans locked at 65%, no lock on the power.

Even with after pushing the lhr-tune down to 69.0, it crashed again. That log is below.

Mining at us2.alephium.herominers.com:1199 [51.81.66.172]
GPU #0: RTX 3060 - 478.24 MH/s, [LHR 70.4<>] [T:64C, P:133W, F:70%, E:3.57MH/W], 22/22 R:0% I:0%
GPU #1: RTX 3060 - 509.16 MH/s, [LHR 70.4<>] [T:62C, P:134W, F:60%, E:3.72MH/W], 15/15 R:0% I:0%
Hashrate: 987.40 MH/s, Shares/min: 7.295 (Avg. 1.093), Avg.P: 271W, Avg.E: 3.64MH/W
Uptime: 36 mins | Algo: blake3 | T-Rex v0.25.8

20220310 11:09:11 conn1: ethash epoch: 478, diff: 281.99 M
20220310 11:09:12 ethash [ OK ] 199/199 - 66.65 MH/s, 36ms ... GPU #1 | 2.46 G
20220310 11:09:22 ethash [ OK ] 200/200 - 68.19 MH/s, 40ms ... GPU #1 | 720.57 M
20220310 11:09:28 ethash [ OK ] 201/201 - 67.11 MH/s, 28ms ... GPU #1 | 512.83 M
20220310 11:09:30 ethash [ OK ] 202/202 - 68.40 MH/s, 26ms ... GPU #0 | 312.85 M
20220310 11:09:30 ethash [ OK ] 203/203 - 68.40 MH/s, 28ms ... GPU #1 | 392.57 M
20220310 11:09:31 TREX: Can't find nonce with device [ID=1, GPU #1], cuda exception: CUDA_ERROR_ILLEGAL_ADDRESS, try to reduce overclock to stabilize GPU state
20220310 11:09:31 WARN: Miner is going to shutdown...
20220310 11:09:31 Main loop finished. Cleaning up resources...
20220310 11:09:31 ApiServer: stopped listening on 127.0.0.1:4059
20220310 11:09:32 T-Rex finished.

t-rex exited (exitcode=0), waiting to cooldown a bit

and

20220310 11:09:37 conn1: Authorizing...
20220310 11:09:37 conn1: Authorized successfully.
20220310 11:09:37 conn1: ethash epoch: 478, diff: 1.13 G
20220310 11:09:37 conn2: Authorized successfully.
20220310 11:09:37 GPU #1: [LHR 69.0<>] intensity 21
20220310 11:09:37 conn2: Extranonce is set to: 21a5
20220310 11:09:42 GPU #1: generating DAG 4.73 GB for epoch 478 ...
20220310 11:09:42 GPU #0: generating DAG 4.73 GB for epoch 478 ...
20220310 11:09:55 GPU #1: DAG generated [crc: 0f064dcb, time: 13588 ms], memory left: 6.86 GB
20220310 11:09:55 GPU #0: DAG generated [crc: 0f064dcb, time: 13663 ms], memory left: 6.86 GB
20220310 11:10:00 conn1: Extranonce is set to: f6a89d
20220310 11:10:00 conn1: ethash epoch: 478, diff: 1.13 G
20220310 11:10:17 GPU #1: using dual ratio 9
20220310 11:10:18 GPU #0: using dual ratio 8
20220310 11:10:29 GPU #1: using LHR dual ratio 15
20220310 11:10:29 GPU #0: using LHR dual ratio 15
20220310 11:10:32 GPU #0: target hashrate for unlocker - 33.94 MH/s
20220310 11:10:32 GPU #1: target hashrate for unlocker - 33.67 MH/s
^C20220310 11:10:34 WARN: shutdown t-rex, signal [2] received
20220310 11:10:34 Main loop finished. Cleaning up resources...
20220310 11:10:34 ApiServer: stopped listening on 127.0.0.1:4059
20220310 11:10:35 ethash [ OK ] 1/1 - 0.00 H/s, 166ms ... GPU #0 | 1.14 G
20220310 11:10:36 T-Rex finished.

Miner:   t-rex
Version: 0.25.8

Trying to release TIME_WAIT sockets:
tcp        0      0 127.0.0.1:52506         127.0.0.1:4059          TIME_WAIT  
tcp        0      0 127.0.0.1:52500         127.0.0.1:4059          TIME_WAIT  
tcp        0      0 127.0.0.1:52494         127.0.0.1:4059          TIME_WAIT  
tcp        0      0 127.0.0.1:52488         127.0.0.1:4059          TIME_WAIT  
tcp        0      0 127.0.0.1:52512         127.0.0.1:4059          TIME_WAIT  

20220310 11:10:40 T-Rex NVIDIA GPU miner v0.25.8  -  [Linux]
20220310 11:10:40 r.bd8c4366e105
20220310 11:10:40 
20220310 11:10:40 
20220310 11:10:40 NVIDIA Driver v470.86
20220310 11:10:40 
20220310 11:10:40 + GPU #0: [00:01.0|2504] GeForce RTX 3060, 12053 MB
20220310 11:10:40 + GPU #1: [00:03.0|2504] GeForce RTX 3060, 12053 MB
20220310 11:10:40 
20220310 11:10:40 WARN: DevFee 1% (ethash/blake3)
20220310 11:10:40 
20220310 11:10:40 Pools for ethash:
20220310 11:10:40 URL : stratum+tcp://daggerhashimoto.usa-east.nicehash.com:3353
20220310 11:10:40 USER: 3B5g5K4b5yeLbfJ7G6zMWL4vLvwHbzNUNx
20220310 11:10:40 PASS: x
20220310 11:10:40 
20220310 11:10:40 Pools for blake3:
20220310 11:10:40 URL : stratum+tcp://us2.alephium.herominers.com:1199
20220310 11:10:40 USER: 1AESHrJiKpRmgpQQC84w82DptNZPfhpk5Ezyt6MEbkNb8.EVGA_Miner
20220310 11:10:40 PASS: 
20220310 11:10:40 
20220310 11:10:40 WARN: Built-in watchdog has been disabled!
20220310 11:10:40 Starting on: daggerhashimoto.usa-east.nicehash.com:3353
20220310 11:10:40 Starting on: us2.alephium.herominers.com:1199
20220310 11:10:40 ApiServer: HTTP server started on 127.0.0.1:4059
20220310 11:10:40 ---------------------------------------------------
20220310 11:10:40 For control navigate to: http://127.0.0.1:4059/trex
20220310 11:10:40 ---------------------------------------------------
20220310 11:10:40 conn1: Using protocol: stratum2.
20220310 11:10:40 conn2: Authorizing...
20220310 11:10:40 GPU #0: [LHR 69.0<>] intensity 21
20220310 11:10:40 conn1: Extranonce is set to: 7c9deb
20220310 11:10:40 conn1: Authorizing...
20220310 11:10:40 conn1: Authorized successfully.
20220310 11:10:40 conn1: ethash epoch: 478, diff: 1.13 G
20220310 11:10:40 GPU #1: [LHR 69.0<>] intensity 21
20220310 11:10:40 conn2: Authorized successfully.
20220310 11:10:40 conn2: Extranonce is set to: 3636
20220310 11:10:44 GPU #1: generating DAG 4.73 GB for epoch 478 ...
20220310 11:10:44 GPU #0: generating DAG 4.73 GB for epoch 478 ...
20220310 11:10:57 GPU #1: DAG generated [crc: 0f064dcb, time: 13582 ms], memory left: 6.86 GB
20220310 11:10:58 GPU #0: DAG generated [crc: 0f064dcb, time: 13634 ms], memory left: 6.86 GB
20220310 11:11:20 GPU #1: using dual ratio 8
20220310 11:11:20 GPU #0: using dual ratio 8

> Miner screen is running

keaton_hiveon · March 10, 2022, 9:08pm

reduce memory each crash and reboot until stable

niner · March 11, 2022, 9:48pm

That’s what I’m working on. Should I focus on only memory, or should I be looking at core clock as well?

As always, thank you for all you help.

keaton_hiveon · March 12, 2022, 3:01am

just memory as long as your core is 1500, leave it there.

system · May 2, 2023, 6:01pm

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.