Hello everyone!
I need help to identify wich card is having problems in one of my rigs… summer is coming in here and temps are rising. For the last two days this rig has rebooted two times with errors: “GPU driver error, no temps” and “GPU are lost, rebooting”
The logs for the first error show this:
Dec 06 18:46:17 Ultimate_Rig kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=3208, 0000(0000) 00000000 00000000
Dec 06 18:46:17 Ultimate_Rig kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=3208, Ch 00000010
00:02.0 Temp: 0C Fan: 0% Power: 0W
01:00.0 Temp: 56C Fan: 76% Power: 213W
02:00.0 Temp: 55C Fan: 66% Power: 113W
03:00.0 Temp: 53C Fan: 66% Power: 111W
04:00.0 Temp: 55C Fan: 66% Power: 111W
06:00.0 Temp: 50C Fan: 66% Power: 115W
07:00.0 Temp: 53C Fan: 66% Power: 111W
08:00.0 Temp: 52C Fan: 66% Power: 111W
09:00.0 Temp: 54C Fan: 66% Power: 123W
The second error just show the first two lines of the previous one.
I can´t understand wich card is having problems, the first one is a 3080 and the rest are 3070s. This is an 8 card rig, I don´t know wich is this one: “00:02.0 Temp: 0C Fan: 0% Power: 0W”
I hope someone can give me a hand here!
Thanks for your time, regards!