Help me Identify the card having troubles

Hello everyone!

I need help to identify wich card is having problems in one of my rigs… summer is coming in here and temps are rising. For the last two days this rig has rebooted two times with errors: “GPU driver error, no temps” and “GPU are lost, rebooting”

The logs for the first error show this:

Dec 06 18:46:17 Ultimate_Rig kernel: NVRM: Xid (PCI:0000:01:00): 62, pid=3208, 0000(0000) 00000000 00000000
Dec 06 18:46:17 Ultimate_Rig kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=3208, Ch 00000010
00:02.0 Temp: 0C Fan: 0% Power: 0W
01:00.0 Temp: 56C Fan: 76% Power: 213W
02:00.0 Temp: 55C Fan: 66% Power: 113W
03:00.0 Temp: 53C Fan: 66% Power: 111W
04:00.0 Temp: 55C Fan: 66% Power: 111W
06:00.0 Temp: 50C Fan: 66% Power: 115W
07:00.0 Temp: 53C Fan: 66% Power: 111W
08:00.0 Temp: 52C Fan: 66% Power: 111W
09:00.0 Temp: 54C Fan: 66% Power: 123W

The second error just show the first two lines of the previous one.

I can´t understand wich card is having problems, the first one is a 3080 and the rest are 3070s. This is an 8 card rig, I don´t know wich is this one: “00:02.0 Temp: 0C Fan: 0% Power: 0W”

I hope someone can give me a hand here!

Thanks for your time, regards!

Look under each one of your cards and you will see the bus address.
The card with “0000:01:00” under the GPU # is the one throwing the error.

Thank you very much for your help, is time to do some maintance :slight_smile:

Regards!

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.