Excessive GPU temperature rise. ETH

hello, I had a problem today at 9 am. I have a 3090 rig and a 3080 rig, both of them have 1 card with only palit gamerock in it, starting from 9 in the morning. There is 1 each in 3080 rig and 3090 rig. I checked the MH fluctuations of both systems from hiveos, and both cards are getting ridiculously hot at the same time, palit gamingpro does not have this problem.

two different rigs
I updated hiveos, I pulled trex to the latest version, I did nvidia 51 60 02
still running hot gpu part but dropped to 3080 58
3090 dropped 69 degrees
they were both over 80
I think you have a software problem but I couldn’t solve it.

Vram temperatures of both cards are below 100
3080 60-92
3090 70-96
working
momentary now
The 3090 vram was running cooler, but when the gpu went to 70, it increased a bit.
but since 9 am, both of them start to rise in temperature at the same time
same ripple same model card different rig
I had no problems with palit gamerock cards.
If we say putty in the gpu, it is ridiculous that two models of different rig cards are like this at the same time, it seems like the 3080 was almost fixed after the hiveos miner nvidia updates, but the 3090 works in 70.
It looks like there is a software problem
I’m waiting for your comments if you have any thoughts or ideas.

3080

3090

I think the problem is with hiveos.
two different rigs and both rigs are in front of the air conditioner.
There was no closure in the kilims.
There is a similar situation in models with msi, but not as much as this rate.
In saad 9, two different rigs are independent of each other, only the same two model cards do this.

After updating hiveos trex and nvidia, it dropped 10 degrees but still did not regress to operating temperatures before 9 am.
The temperature dropped further to 3080.

Are you on the latest stable image/kernel? Using locked core clocks?

I am using 510.60.02 and yes locked core
1100 core 2200 mem 315
I’ve been using it for a long time, it’s a card with no problems.
I’ve been using it for a long time, it’s a card with no problems. It happened again at 23:30 for 10 minutes and it recovered spontaneously, but still did not regress to the temperatures before 9:10 in the morning.
I contacted hiveos live support. He said that the problem is you, there is nothing wrong with us, when it happened again at 23:30, they said yes, this is interesting. It shows 80+ on the chart but the card temperature said 60. But its temperature increases other cards by 3-4 degrees and the mh decreases.

What kernel are you on? Latest?

hiveos was in the old version, when the temperature increased, I updated it to the latest version, I made the nvidia driver 510.60.02 and I upgraded trex to the latest version.
As soon as I did these, the temperature dropped, but it runs 5-10 degrees higher than before 9:10 am.
It happened again at 23:30, took 10 minutes and fell by itself. Hiveos live support gave the answer above. The report has been submitted to the developer team.

What kernel are you on? Can you post a screenshot of your worker overview screen? If it’s not 110, flash the latest stable image

5.4.0-hiveos #140 in this version

On the 3090 rig, only “5.4.80-hiveos” is written.
On the 3080 rig, it says “5.4.0-hiveos #140”.

So those are both about a year old. First rule when troubleshooting after having any kind of issue is make sure everything is up to date, as there’s a good chance any issue you’re having was already fixed in the last year.

Which version should I install I made a new update.5.10.0-hiveos #110
but it still did not fall to the temperature value before 9:10 yesterday morning. Yesterday morning at 9:10, the temperature is on two different rigs, the same brand and model card, one 3080 and the other 3090 reacts at the same time and the temperature is 80+, it did it again at 23:30 at night, but it took 10 minutes. I upgraded to the kernel version I specified, the stable one in the list.

#110 is what you want. are you running autofan? can you post a screenshot of your current worker overview screen?

The system still runs a little hotter than before 9:10 yesterday, if it doesn’t go above 80+ pointlessly during the day, the problem has improved a bit.
The card I specifically mentioned is GPU 3

There was no such problem until 9:10 yesterday morning in Rig, it was working very comfortably. Even after updating everything after pointlessly 80+ the temperatures did not return to normal.

I am using manual fan setting.

looks like you have a high ambient temp, and almost all your cards need thermal pads upgraded aswell.

if you had 2 separate rigs have temp issues at the same time. im assuming they’re in the same room? and your exhaust/ventilation isnt keeping up and or fan shutting off etc? this isn’t a software issue causing them to heat up, its a heat issue.

Before doing all the updates it was fine until 9:10 am yesterday morning.
The air conditioners are working, the fans are working, I am following the ambient temperature yesterday and the day before that, the same two rigs are in separate rooms and the same brand card in them was 80+ in the same minute.
Only one card in both rigs became 80+.
I made the updates at 16:30, except for the kernel, the temperatures dropped to 20-25 degrees in those two increments, but at 23:30 it became 80+ again. The ambient temperatures, air conditioners and fans work uninterruptedly, there is no problem at all. Now I also did a kernel update 80+ I hope not.

post the power draw graphs and fan speed graphs at the time of the higher temps. if the power draw doesn’t matically increase, or fan speed decrease, its a local temp issue.

Does the local temperature affect only 1 card in two different rigs and I had no problems with this model card until yesterday, no one lives in the world. At 23:30, both cards were 80+, did the local temperature rise and fall back in 10 minutes? Or did it go up at 9:10 just after I did the updates? I guess the update is affecting the local temperature (:

This card was working at 50 degrees before June 16, 9:10. There was such a problem. After 9:10, 80+ worked, I made updates, 60+ started working.

try turning the ac or exhaust vent or whatever you have set off in that room and i bet that graph will look identical again. whichever card is getting the most heat from the others will heat up most.

power draw decreased because of thermal throttling. im gonna vote this isnt a software issue and a local temp issue still.

whats your ambient temp? 60c on the core for my 3090s is 90F or so ambient. is your room that hot with ac on?

There are 3 air conditioners and 4 industrial fans. I can constantly check whether the ambient temperature has changed after the update, the ambient temperature is the same, the air conditioners are working, the fans are working. There is no problem. What is the logic of it going over 80+ and dropping after updating?