Tag Archives: Nagios

Internal IPMI error / Stopping kipmi0

As previously documented on this site, I use Nagios extensively. I’ve used the check_ipmi_sensor plugin for a while now, but have had problems on a Centos 6.6 Supermicro box that I had installed it on.

I’d regularly get hit with failures caused by the IPMI failing to return full payloads, and the freeipmi tools frequently dumping out halfway through execution stating that there was an “Internal IPMI Error” – here’s an example running check_ipmi_sensor from the command line:

# ./check_ipmi_sensor -H localhost -fc 5 -v
ID   | Name            | Type              | State    | Reading    | Units | Event
4    | CPU1 Temp       | Temperature       | Nominal  | 38.00      | C     | 'OK'
71   | CPU2 Temp       | Temperature       | Nominal  | 40.00      | C     | 'OK'
138  | PCH Temp        | Temperature       | Nominal  | 31.00      | C     | 'OK'
205  | System Temp     | Temperature       | Nominal  | 26.00      | C     | 'OK'
272  | Peripheral Temp | Temperature       | Nominal  | 41.00      | C     | 'OK'
339  | Vcpu1VRM Temp   | Temperature       | Nominal  | 33.00      | C     | 'OK'
406  | Vcpu2VRM Temp   | Temperature       | Nominal  | 39.00      | C     | 'OK'
473  | VmemABVRM Temp  | Temperature       | Nominal  | 29.00      | C     | 'OK'
540  | VmemCDVRM Temp  | Temperature       | Nominal  | 26.00      | C     | 'OK'
607  | VmemEFVRM Temp  | Temperature       | Nominal  | 38.00      | C     | 'OK'
674  | VmemGHVRM Temp  | Temperature       | Nominal  | 32.00      | C     | 'OK'
741  | P1-DIMMA1 Temp  | Temperature       | Nominal  | 27.00      | C     | 'OK'
808  | P1-DIMMB1 Temp  | Temperature       | Nominal  | 27.00      | C     | 'OK'
875  | P1-DIMMC1 Temp  | Temperature       | Nominal  | 27.00      | C     | 'OK'
942  | P1-DIMMD1 Temp  | Temperature       | Nominal  | 26.00      | C     | 'OK'
1009 | P2-DIMME1 Temp  | Temperature       | Nominal  | 28.00      | C     | 'OK'
1076 | P2-DIMMF1 Temp  | Temperature       | Nominal  | 29.00      | C     | 'OK'
1143 | P2-DIMMG1 Temp  | Temperature       | Nominal  | 28.00      | C     | 'OK'
1210 | P2-DIMMH1 Temp  | Temperature       | Nominal  | 29.00      | C     | 'OK'
1411 | FAN3            | Fan               | Nominal  | 6500.00    | RPM   | 'OK'
1478 | FAN4            | Fan               | Nominal  | 6400.00    | RPM   | 'OK'
ipmi_sensor_read: internal IPMI error

-> Execution of /usr/sbin/ipmi-sensors failed with return code 1.
-> /usr/sbin/ipmi-sensors was executed with the following parameters:
   sudo /usr/sbin/ipmi-sensors --quiet-cache --sdr-cache-recreate --interpret-oem-data --output-sensor-state --ignore-not-available-sensors

This obviously isn’t especially ideal – and I was seeing this every 10-30 minutes when Nagios was running its regular checks. First glances, it looked like it could be caused by defective hardware, but fortunately, that wasn’t the case – just my checks colliding with kipmid.

Centos 6 has built ipmi_si into the kernel by default – and kipmid starts on boot and starts polling any detected IPMI devices (you’ll probably see [kipmi0] running). Given I’m configuring Nagios for monitoring, I’ve no interest in the kernel helper polling my IPMI and tying it up, and if you’re reading this and you have a kernel ipmid thread, chances are, neither do you. Alternatively, it’s possible you’re reading this because your kipmi0 process is claiming to use 100% of a CPU core, perhaps because you did a bmc cold reboot or similar – in which case, this is probably useful for you too, even if you want to keep kipmid running.

You can stop kipmid in its tracks by hot-removing the IPMI device from it’s list of detected devices; first, get the parameters for the detected device like this:

# cat /proc/ipmi/0/params
kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=0

Then, take the output from the above, prefix it with “remove,” and use /sys/module/ipmi_si/parameters/hotmod to remove the device:

# echo "remove,kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=0" > /sys/module/ipmi_si/parameters/hotmod

The kipmid thread will be cleaned up immediately without requiring a reboot. freeipmi’s various utilities do not use ipmi_si/kipmid and will continue to work just fine.

If all you wanted to do was restart kipmid for some reason, you could then re-add the device by instead prefixing with “add,”:

# echo "add,kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=0" > /sys/module/ipmi_si/parameters/hotmod

Whereupon kipmi0 will resume.

If stopping kipmid fixes your issue (it provided immediate relief in my case), make sure it stays gone by adding the following to your kernel options in /boot/grub/menu.lst (or, wherever your bootloader configuration is, if you’re not using the default grub environment)

ipmi_si.force_kipmid=0

Weirdly, kipmid doesn’t seem to cause problems with Ubuntu 14.04 (on a Dell R200) or Debian 8 (on a different Supermicro board), so perhaps this is a centos specific issue.

Tips for Configuring Nagios3 Efficiently – part 1

Back when I started using Nagios (I think ~1.2 or earlier) I don’t remember many options for being all that efficient in terms of “lines of config written” – certainly, any options for being efficient that there may have been ended up being overlooked in the rush to get it up and running, and I’ve been largely been using the same configuration files (and style) ever since – though I did start using host and service templates as soon as I became aware of them some time back in the 2.x branch days.

In the spirit of self-improvement, I’ve been revisiting the Nagios configuration syntax as part of rolling out a fresh monitoring host based on Nagios3, and have significantly reduced the number of lines of config my Nagios installation depends on as a result.

Continue reading Tips for Configuring Nagios3 Efficiently – part 1

Installing Nagios3 on Debian Wheezy

It’s pretty straightforward to install Nagios on a Debian system but if you want to be able to use the web interface to control the nagios process a little more work is required.

Starting with a blank slate (apt/dpkg will ensure any required prerequisites will be installed):

# apt-get install nagios3 apache2-suexec

You’ll be asked to set a password for the nagiosadmin user for the web interface.

Enable check_external_commands in Nagios to enable the ability to mute alarms, make comments, restart the nagios process etc from the web interface (pretty much invaluable, but be aware of the inherent risks in enabling the ability to influence the process from “outside”)

# sed -i -e 's/check_external_commands=0/check_external_commands=1/' /etc/nagios3/nagios.cfg
# /etc/init.d/nagios3 restart

Edit the nagios3 apache2 config include to make the web interface scripts run as the nagios user so that the web interface can write to the nagios command pipe; inserting the following at the top of /etc/nagios3/apache2.conf:

User nagios
Group nagios

Restart apache..

# /etc/init.d/apache2 restart

And you’re pretty much done! You can go to http://YOUR_HOST_NAME/nagios3/ and log in with your nagiosadmin password you set up when prompted at the start of this process.

Now, you can get started with creating host and service configuration files in /etc/nagios3/conf.d/ to monitor your servers/network/etc