Hello Keaton,
Thank you for reply. I forgot to mention - before asking a question here on forum, I was searching a lot for this problem and possible solutions. I already found, that this may be problem with overclocking, but all threads I found related to this error said about some continuous problems, which appears quite fast, after few minutes of mining.
In my case, rig was was working perfectly for more than 2 weeks and problem occurred only once so far (last night).
Also, together with “GPU driver error, no temps” appearance I saw the payload in log file - it clearly indicates correct values of temperatures from all 4 cards. That’s why I thought about problem with payload parsing.
So the question is "why hiveOs returned “GPU driver error, no temps” while payload contains correct temps values?
I dig little more in system and kernel logs. What I discovered here is error which came from nVidia driver (happened at the beginning at 03:01:54):
[ T1030] NVRM: GPU at PCI:0000:0c:00: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000010
[ T1030] NVRM: Xid (PCI:0000:0c:00): 62, pid=3889, 0000(0000) 00000000 00000000
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000010
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000011
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000012
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000013
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000014
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000015
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000016
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000017
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000011
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000012
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000013
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000014
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000015
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000016
[ T1030] NVRM: Xid (PCI:0000:0c:00): 45, pid=3889, Ch 00000017
[ C3] sched: RT throttling activated
[ C3] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[ C3] clocksource: 'acpi_pm' wd_now: 79289a wd_last: d68cc6 mask: ffffff
[ C3] clocksource: 'tsc' cs_now: 1cfb670f0cf4 cs_last: 1cf930817b57 mask: ffffffffffffffff
[ C3] tsc: Marking TSC unstable due to clocksource watchdog
[ C3] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/46-nvidia:1030]
[ C3] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper cec drm
drm_panel_orientation_quirks cfbfillrect cfbimgblt cfbcopyarea fb_sys_fops syscopyarea sysfillrect sysimgblt fb fbdev intel_rapl_msr intel_rapl_common x86_pkg_temp
_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul joydev input_leds crc32_pclmul ghash_clmulni_intel aesni_intel mei_me mei gpio_ich crypt
o_simd cryptd lpc_ich glue_helper serio_raw rapl intel_cstate mac_hid sch_fq_codel sunrpc droptcpsock(OE) ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_
storage r8169 ahci libahci realtek
[ C3] CPU: 3 PID: 1030 Comm: irq/46-nvidia Tainted: P OE 5.10.0-hiveos #83.hiveos.211201
After that “watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [irq/46-nvidia:1030]” appears cyclically every 22 seconds until reboot, which has place 25 minutes later.
There are two Xid error codes 45 and 62. 62 means “Internal micro-controller halt (newer drivers)”, which is understandable after some failure, but first 45 appears - “Preemptive cleanup, due to previous errors - Most likely to see when running multiple cuda applications and hitting a DBE”. This could indeed indicate problem with miner application, or as suggested - something related to overclocking.
At the moment I only wonder if HiveOS correctly handles this problem. Since error occurrence, it seems nvidia driver stopped to respond anyhow. So in this case - there is no need to wait 25 minutes before reset (especially, that we don’t know what is the state and condition of failure GPU). I think, for security purposes, system reset should have place immediately after such error appearance (eventually after a minute or two to allow system service, to avoid endless loop).
Could you please indicate what should I change (configuration of which file), to get faster system restart?