Hi all,
Can’t get stable work of 3090-s on HiveOS.
GPU: 2x3090 premium brands, no thermal issues
Motherboard: H61M-P20 (G3) (MS-7788) MSI (V1.6 04/17/2012)
CPU: 2 × Intel® Celeron® CPU G1620 @ 2.70GHz
Disk: USB 2.0
Ambient temp at ~20-25C
Cards temp at 52C max, usually 40-49C
HiveOS, ethermine, t-rex (newest)
On Windows, I got stable work for 5 days at memory junction temp not exceeding 94C with the following overclocks: -200, 1200, fans 100% (all on MSI afterburner), 273-299W power limit (printed by t-rex), cards at 48C, ambient at 28-29C.
When I turned to HiveOS reflecting same overclocks (C-200, M2400, F100%, PL300W) the rig keeps crashing - LA (load average) increases and rig stops hashing, but still consume watts at the wall. Very frustrating.
Sometimes recovers after rebooting, sometimes not.
Log from t-rex at HiveOS shows following message:
“t-rex exited (exitcode=0), waiting to cooldown a bit”
But cards at <50C before and during crash !!!
Other meaningful prints could be:
“TREX: Can’t initialize device [ID=1, GPU #1], cuda exception in [initialize_device, 96], CUDA_ERROR_UNKNOWN”
“WARN: GPU #0(000100): ASUS GeForce RTX 3090, intensity 22”
“WARN: Built-in watchdog has been disabled!”
“WARN: NVML: can’t get fan speed for GPU #1, error code 999”
Played with overclocks a lot to find any solution. Even hashed at 80% of potential, still crashes with LA increase.
Any tips from admins?