As previously documented on this site, I use Nagios extensively. I’ve used the check_ipmi_sensor plugin for a while now, but have had problems on a Centos 6.6 Supermicro box that I had installed it on.
I’d regularly get hit with failures caused by the IPMI failing to return full payloads, and the freeipmi tools frequently dumping out halfway through execution stating that there was an “Internal IPMI Error” – here’s an example running check_ipmi_sensor from the command line:
# ./check_ipmi_sensor -H localhost -fc 5 -v ID | Name | Type | State | Reading | Units | Event 4 | CPU1 Temp | Temperature | Nominal | 38.00 | C | 'OK' 71 | CPU2 Temp | Temperature | Nominal | 40.00 | C | 'OK' 138 | PCH Temp | Temperature | Nominal | 31.00 | C | 'OK' 205 | System Temp | Temperature | Nominal | 26.00 | C | 'OK' 272 | Peripheral Temp | Temperature | Nominal | 41.00 | C | 'OK' 339 | Vcpu1VRM Temp | Temperature | Nominal | 33.00 | C | 'OK' 406 | Vcpu2VRM Temp | Temperature | Nominal | 39.00 | C | 'OK' 473 | VmemABVRM Temp | Temperature | Nominal | 29.00 | C | 'OK' 540 | VmemCDVRM Temp | Temperature | Nominal | 26.00 | C | 'OK' 607 | VmemEFVRM Temp | Temperature | Nominal | 38.00 | C | 'OK' 674 | VmemGHVRM Temp | Temperature | Nominal | 32.00 | C | 'OK' 741 | P1-DIMMA1 Temp | Temperature | Nominal | 27.00 | C | 'OK' 808 | P1-DIMMB1 Temp | Temperature | Nominal | 27.00 | C | 'OK' 875 | P1-DIMMC1 Temp | Temperature | Nominal | 27.00 | C | 'OK' 942 | P1-DIMMD1 Temp | Temperature | Nominal | 26.00 | C | 'OK' 1009 | P2-DIMME1 Temp | Temperature | Nominal | 28.00 | C | 'OK' 1076 | P2-DIMMF1 Temp | Temperature | Nominal | 29.00 | C | 'OK' 1143 | P2-DIMMG1 Temp | Temperature | Nominal | 28.00 | C | 'OK' 1210 | P2-DIMMH1 Temp | Temperature | Nominal | 29.00 | C | 'OK' 1411 | FAN3 | Fan | Nominal | 6500.00 | RPM | 'OK' 1478 | FAN4 | Fan | Nominal | 6400.00 | RPM | 'OK' ipmi_sensor_read: internal IPMI error -> Execution of /usr/sbin/ipmi-sensors failed with return code 1. -> /usr/sbin/ipmi-sensors was executed with the following parameters: sudo /usr/sbin/ipmi-sensors --quiet-cache --sdr-cache-recreate --interpret-oem-data --output-sensor-state --ignore-not-available-sensors
This obviously isn’t especially ideal – and I was seeing this every 10-30 minutes when Nagios was running its regular checks. First glances, it looked like it could be caused by defective hardware, but fortunately, that wasn’t the case – just my checks colliding with kipmid.
Centos 6 has built ipmi_si into the kernel by default – and kipmid starts on boot and starts polling any detected IPMI devices (you’ll probably see [kipmi0] running). Given I’m configuring Nagios for monitoring, I’ve no interest in the kernel helper polling my IPMI and tying it up, and if you’re reading this and you have a kernel ipmid thread, chances are, neither do you. Alternatively, it’s possible you’re reading this because your kipmi0 process is claiming to use 100% of a CPU core, perhaps because you did a bmc cold reboot or similar – in which case, this is probably useful for you too, even if you want to keep kipmid running.
You can stop kipmid in its tracks by hot-removing the IPMI device from it’s list of detected devices; first, get the parameters for the detected device like this:
# cat /proc/ipmi/0/params kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=0
Then, take the output from the above, prefix it with “remove,” and use /sys/module/ipmi_si/parameters/hotmod to remove the device:
# echo "remove,kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=0" > /sys/module/ipmi_si/parameters/hotmod
The kipmid thread will be cleaned up immediately without requiring a reboot. freeipmi’s various utilities do not use ipmi_si/kipmid and will continue to work just fine.
If all you wanted to do was restart kipmid for some reason, you could then re-add the device by instead prefixing with “add,”:
# echo "add,kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=0" > /sys/module/ipmi_si/parameters/hotmod
Whereupon kipmi0 will resume.
If stopping kipmid fixes your issue (it provided immediate relief in my case), make sure it stays gone by adding the following to your kernel options in /boot/grub/menu.lst (or, wherever your bootloader configuration is, if you’re not using the default grub environment)
ipmi_si.force_kipmid=0
Weirdly, kipmid doesn’t seem to cause problems with Ubuntu 14.04 (on a Dell R200) or Debian 8 (on a different Supermicro board), so perhaps this is a centos specific issue.