GPU Driver Error, No Temps Error

falsealarm · January 9, 2022, 7:33pm

I have one of my rigs reboot probably about once every two days due to “GPU driver error, no temps”. When I click on the error though, I see the following with no additional information. I understand from other posts that this could be the memory overclock on one of the cards, or a power delivery issue, or even ghost settings (needing a re-image of the drive). I’d like to figure out which card this is though. Is there a way to do that.

01:00.0 Temp: 61C Fan: 63% Power: 115W
0c:00.0 Temp: 50C Fan: 69% Power: 262W
0d:00.0 Temp: 44C Fan: 30% Power: 109W
0e:00.0 Temp: 55C Fan: 53% Power: 260W

keaton_hiveon · January 10, 2022, 12:01am

Can you post a screenshot of your worker overview screen?

falsealarm · January 10, 2022, 5:14am

Not much to go with on the screenshot unfortunately

keaton_hiveon · January 10, 2022, 5:48am

Try setting the fan on the 3080tis to 100 and reducing the mem clocks slightly on those too.

falsealarm · January 11, 2022, 7:17pm

Not having much luck with the memory clock changes. Rig is now rebooting more often. Is there a way to look at the logs to identify at least which card is being reported for this error? It’s not available in the GUI, and I had no luck finding it in /var/log logs available.

keaton_hiveon · January 11, 2022, 7:36pm

Reduce memory clocks by a lot and see if it’s stable, that’s almost always the issue.

falsealarm · January 11, 2022, 8:11pm

Thanks for linking to the article. I’ll reduce memory clock settings even further. If I have to suspect any one card, it would be the MSI 3080Ti. Another of the same card, gave me issues on a separate rig but now runs fine with the settings possibly not working on this rig.

falsealarm · January 11, 2022, 8:39pm

Interestingly, all other instances of similar errors posted in this forum, users are able identify the address from the BUS ID as the error screen clearly shows a “0” temperature for the card in question. I don’t see that in the GUI. Is that because I am running nVidia cards, and in Linux nVidia drivers don’t post tjunction temperatures?

keaton_hiveon · January 12, 2022, 1:07am

it just depends how it crashes

4em6epc · March 14, 2022, 1:47pm

the rig worked fine all night, an error began to appear during the day, video cards were not mined, although they were online
sometimes it just goes offline

keaton_hiveon · March 16, 2022, 4:28am

dont use core offsets on 30 series. locked core clocks will use less power and be more stable

system · May 6, 2023, 7:29pm

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.