After 2 weeks of smooth working, "GPU driver error,no temps" appeared and casued rig reset

Hello,

After 2 or 3 weeks of working, today’s night rig reset occured. Following the logs, problem was caused by sudden appearance of “GPU driver error,no temps”. However, checking the logs (“payload” filed) it seems temperatures appeared correctly in payload. So the question is, if this is a problem with payload parsing? Or something else?

What is also important - it seems that miner stopped working after this error appeared (last miner’s correct entry in log was from 03:02:00, while “GPU driver error, no temps” appeared at 03:02:32). After “GPU driver error, no temp” occured, “miner_stats” field in log became empty (“null”).

After 20 minutes this problem caused rig reset, as “cpu load” indicator started to go higher and higher each minute, until it reach >28.

My configuration is 4x 3060Ti (all the same cards, bought at the same time). Very gentle overclock settings, same to all 4 cards (Mem: 2800, CPU: 1440, PL: 140 - they were working well for 2 weeks and cards are quite cold), miner used: t-rex miner. HveOS: 0.6-212@211130, nVidia driver: 470.86

[Sun Jan 30 03:02:32 CET 2022] > 
{
	"method":"message",
	"jsonrpc":"2.0",
	"id":0,"params":
	{
		"rig_id":"1111111",
		"passwd":"xxx",
		"type":"warning",
		"data":"GPU driver error,no temps",
		"payload":"
		\n04:00.0 Temp: 49C Fan: 65% Power: 128W
		\n05:00.0 Temp: 56C Fan: 65% Power: 127W
		\n0b:00.0 Temp: 52C Fan: 65% Power: 132W
		\n0c:00.0 Temp: 48C Fan: 65% Power: 121W
		"
	}
}

After reset rig continued to work, as it was so far (without any problems).
Anyone have any idea what happened last night?

Kind Regards!

Typically that error is from too aggressive overclocks, if it happens again lower the memory clocks and see if it helps.

Hello Keaton,

Thank you for reply. I forgot to mention - before asking a question here on forum, I was searching a lot for this problem and possible solutions. I already found, that this may be problem with overclocking, but all threads I found related to this error said about some continuous problems, which appears quite fast, after few minutes of mining.

In my case, rig was was working perfectly for more than 2 weeks and problem occurred only once so far (last night).
Also, together with “GPU driver error, no temps” appearance I saw the payload in log file - it clearly indicates correct values of temperatures from all 4 cards. That’s why I thought about problem with payload parsing.

So the question is "why hiveOs returned “GPU driver error, no temps” while payload contains correct temps values?

I dig little more in system and kernel logs. What I discovered here is error which came from nVidia driver (happened at the beginning at 03:01:54):

[ T1030] NVRM: GPU at PCI:0000:0c:00: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000010
[ T1030] NVRM: Xid (PCI:0000:0c:00): 62, pid=3889, 0000(0000) 00000000 00000000
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000010
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000011
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000012
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000013
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000014
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000015
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000016
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000017
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000011
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000012
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000013
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000014
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000015
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000016
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000017
[    C3] sched: RT throttling activated
[    C3] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    C3] clocksource:                       'acpi_pm' wd_now: 79289a wd_last: d68cc6 mask: ffffff
[    C3] clocksource:                       'tsc' cs_now: 1cfb670f0cf4 cs_last: 1cf930817b57 mask: ffffffffffffffff
[    C3] tsc: Marking TSC unstable due to clocksource watchdog
[    C3] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/46-nvidia:1030]
[    C3] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper cec drm
 drm_panel_orientation_quirks cfbfillrect cfbimgblt cfbcopyarea fb_sys_fops syscopyarea sysfillrect sysimgblt fb fbdev intel_rapl_msr intel_rapl_common x86_pkg_temp
_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul joydev input_leds crc32_pclmul ghash_clmulni_intel aesni_intel mei_me mei gpio_ich crypt
o_simd cryptd lpc_ich glue_helper serio_raw rapl intel_cstate mac_hid sch_fq_codel sunrpc droptcpsock(OE) ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_
storage r8169 ahci libahci realtek
[    C3] CPU: 3 PID: 1030 Comm: irq/46-nvidia Tainted: P           OE     5.10.0-hiveos #83.hiveos.211201

After that “watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/46-nvidia:1030]” appears cyclically every 22 seconds until reboot, which has place 25 minutes later.

There are two Xid error codes 45 and 62. 62 means “Internal micro-controller halt (newer drivers)”, which is understandable after some failure, but first 45 appears - “Preemptive cleanup, due to previous errors - Most likely to see when running multiple cuda applications and hitting a DBE”. This could indeed indicate problem with miner application, or as suggested - something related to overclocking.

At the moment I only wonder if HiveOS correctly handles this problem. Since error occurrence, it seems nvidia driver stopped to respond anyhow. So in this case - there is no need to wait 25 minutes before reset (especially, that we don’t know what is the state and condition of failure GPU). I think, for security purposes, system reset should have place immediately after such error appearance (eventually after a minute or two to allow system service, to avoid endless loop).

Could you please indicate what should I change (configuration of which file), to get faster system restart?

did you reduce memory clocks on this card?

Yes, of course I reduced memory overclocking settings by 400 for 0c:00 card. So far I haven’t noticed problem reproduction, however, please keep in mind, that previously crash appeared only after 2 weeks of continuous working.

And what about setting, which allows to get faster system restart, than default 25 minutes?
Greetings!

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.