Nvidia-smi and catching bad risers

I just wanted to give a quick feature suggestion.

HiveOS already monitors a lot of parameters. But from other Linux distributions i always go back to nvidia-smi for diagnosing some issues. e.g. a bad riser can be found rather quickly using

nvidia-smi dmon -c 1 -s e

which will output a short list of your cards detected by nvidia-smi and a pci errs column.

If that column isn’t all zeros, you either have a faulty riser or … like me once, forgot to set PCIe from 3.0 to 2.0 :wink:

Might be worth adding this counter to the card overview similar to the invalid share counter, only showing if it is not zero, maybe even with a mouse-over hinting at riser or riser + wrong pcie speed.

2 Likes

This command was a godsend for me.
Can you please share something similar for AMD cards as well since I have mixed rigs?

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.