HiveOS crashing intermittently - how to tell which GPU failed?

Hi everyone
I have 10 Nvidia cards in my rig, and the stupid thing crashes every hour or so, sometimes less, sometimes more.

Sometimes a GPU will crash (Trex miner will say ‘get get temperatures’ or something), and sometimes the OS will just hang and crash completely (cant ping the machine, yet it stays on and doesnt reboot and shows up as ‘offline’ through the web interface)

Is there any way to see what GPU is causing the problem? I would assume that its probably one card that is finicky, crashes, and takes down the whole machine.

Heres an example, rig crashes at ~11:23pm. This is what I can find from the various logs:

Oct 14 23:07:56 Main kernel: [ 56.264671][ T3684] nvidia-uvm: Loaded the UVM driver, major device number 240.
Oct 14 23:23:32 Main kernel: [ 992.409853][ T3736] NVRM: GPU at PCI:0000:2c:00: GPU-adf64899-954b-7e00-95f9-d8a98b2a3d92
Oct 14 23:23:32 Main kernel: [ 992.409856][ T3736] NVRM: GPU Board Serial Number: 15xxxxxxxxxx097
Oct 14 23:23:32 Main kernel: [ 992.409859][ T3736] NVRM: Xid (PCI:0000:2c:00): 62, pid=3634, 0000(0000) 00000000 00000000
Oct 14 23:23:32 Main kernel: [ 992.451903][ T3736] NVRM: Xid (PCI:0000:2c:00): 45, pid=3634, Ch 00000010
Oct 14 23:24:55 Main kernel: [ 1075.156073][T26033] DTS: killing sk:0000000033b0a5ed (127.0.0.1:57594 → 127.0.0.1:4059) state 6
Oct 14 23:24:55 Main kernel: [ 1075.156078][T26033] DTS: killing sk:0000000047f19984 (127.0.0.1:4059 → 127.0.0.1:57594) state 8
Oct 14 23:25:37 Main kernel: [ 1116.870823][T27028] sysrq: Emergency Sync
Oct 14 23:25:37 Main kernel: [ 1116.870851][T27028] sysrq: Emergency Remount R/O

Oct 14 23:22:47 Main hive-watchdog[1466]: OK t-rex 707828 kHs >= 400000 kHs
Oct 14 23:22:54 Main avg_khs[2576]: Preparing algorithm statistics for upload
Oct 14 23:22:54 Main avg_khs[2576]: Uploading ethash statistic saved 2021-10-14 23:22:54.145578873
Oct 14 23:22:54 Main avg_khs[2576]: Uploading algorithm statistics completed
Oct 14 23:22:54 Main avg_khs[2576]: {“params”:{“avg_khs”:{“ethash”:[681253,170313]}}}
Oct 14 23:23:10 Main xinit[2690]: 14/10/2021 23:23:10 idle keyboard: turning X autorepeat back on.
Oct 14 23:23:32 Main kernel: [ 992.409853][ T3736] NVRM: GPU at PCI:0000:2c:00: GPU-adf64899-954b-7e00-95f9-d8a98b2a3d92
Oct 14 23:23:32 Main kernel: [ 992.409856][ T3736] NVRM: GPU Board Serial Number: 156xxxxxxxxxx97
Oct 14 23:23:32 Main kernel: [ 992.409859][ T3736] NVRM: Xid (PCI:0000:2c:00): 62, pid=3634, 0000(0000) 00000000 00000000
Oct 14 23:23:32 Main kernel: [ 992.451903][ T3736] NVRM: Xid (PCI:0000:2c:00): 45, pid=3634, Ch 00000010
Oct 14 23:23:47 Main hive-watchdog[1466]: OK LA(5m): 0.72 < 44.0, LA(1m): 1.56 < 88.0
Oct 14 23:23:47 Main hive-watchdog[1466]: OK t-rex 709912 kHs >= 400000 kHs
Oct 14 23:23:54 Main avg_khs[2576]: {“params”:{“avg_khs”:{“ethash”:[708358,182144]}}}
Oct 14 23:24:07 Main hive-watchdog[1466]: BARK t-rex 0 kHs < 400000 kHs for 39 seconds
Oct 14 23:24:17 Main hive-watchdog[1466]: BARK t-rex 0 kHs < 400000 kHs for 49 seconds
Oct 14 23:24:27 Main hive-watchdog[1466]: BARK t-rex 0 kHs < 400000 kHs for 59 seconds
Oct 14 23:24:37 Main hive-watchdog[1466]: BARK t-rex 0 kHs < 400000 kHs for 69 seconds
Oct 14 23:24:37 Main hive-watchdog[1466]: #033[0;36m> Sending #033[1;37mwarning#033[0;36m with payload to #033[1;36mhttp://api.hiveos.farm#033[0m
Oct 14 23:24:37 Main hive-watchdog[1466]: #033[1;32mOK#033[0m
Oct 14 23:24:37 Main hive-watchdog[1466]: —
Oct 14 23:24:37 Main hive-watchdog[1466]: Restarting t-rex after 1 minutes

TREX MINER LOG:
--------------20211014 23:23:37 ---------------
Mining at us-eth.2miners.com:2020, diff: 8.73 G
GPU #0: RTX 3080 Ti - 61.02 MH/s, [LHR 71<>] [T:56C, P:229W, F:99%, E:266kH/W], 3/3 R:0% I:0%
GPU #1: RTX 3080 - 64.22 MH/s, [LHR 68<>] [T:57C, P:209W, F:57%, E:307kH/W], 7/7 R:0% I:0%
GPU #2: RTX 3070 Ti - 42.42 MH/s, [LHR 71<>] [T:57C, P:149W, F:88%, E:285kH/W], 6/6 R:0% I:0%
GPU #3: RTX 3090 - 112.88 MH/s, [T:58C, P:299W, F:96%, E:378kH/W], 14/14 R:0% I:0%
GPU #4: RTX 3070 - 55.18 MH/s, [T:57C, P:109W, F:59%, E:506kH/W], 8/8 R:0% I:0%
GPU #5: RTX 3080 - 93.87 MH/s, [T:55C, P:209W, F:63%, E:449kH/W], 13/13 R:0% I:0%
GPU #6: RTX 3080 - 72.55 MH/s, [T:56C, P:191W, F:42%, E:382kH/W], 11/11 R:0% I:0%
GPU #7: RTX 3080 - 83.37 MH/s, [T:56C, P:199W, F:82%, E:419kH/W], 16/16 R:0% I:0%
GPU #8: RTX 3090 - 111.66 MH/s, [T:64C, P:299W, F:96%, E:373kH/W], 12/12 R:0% I:0%
Hashrate: 697.16 MH/s, Shares/min: 6.017 (Avg. 5.825), Avg.P: 1892W, Avg.E: 368kH/W
Max diff share was found by GPU #2, diff: 414.70 G
Uptime: 15 mins 40 secs | Algo: ethash | T-Rex v0.24.2

20211014 23:23:38 ethash epoch: 447, block: 13420246, diff: 8.73 G
20211014 23:24:38 WARN: shutdown t-rex, signal [2] received
20211014 23:24:38 Main loop finished. Cleaning up resources…

Is this the line I am looking for?
Oct 14 23:23:32 Main kernel: [ 992.409856][ T3736] NVRM: GPU Board Serial Number: 156xxxxxxxx097

Does that mean the GPU with that serial number crashed?

PCI:0000:2c:00 this is what you are looking for. In Hive Under every GPU number there is a series of digits.
check which one is 2c:00.0. i think this should be the one. do you have such number under one of the GPUs? also you can try by disconnecting one by one and see if it doesnt crash.

Hi All,

Sorry to bumping an old thread. but i do have similar issue, sometimes the rig just froze.
I notice that OP got some log there, any kind soul can guide me how to reach the logs?

should i activate the watchdog?

thanks!

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.