Friday, January 20, 2012

Updating to CentOS 5.7 on a HP Proliant DL380 G5 Server–The Saga Continues

Recently we updated our CentOS server from version 5.5 to 5.7. Many updates were accumulated since the original installation, so I decided to bite the bullet and install them all.

The installation of the updates were straightforward. I did it from the desktop of the GUI, and everything installed without any problems.

But restarting the server was not so straight forward. Since we installed HP PSP, and it was tweaked to work with CentOS 5.5, I certainly did not expect it to work so easily.

Sure enough, the server would reboot itself when it tried to load the IPMI driver.

So at the startup of Linux, I hit the “I” key to go into interactive start up mode. I skipped all of the HP drivers, and the server booted all the way up.

Then I uninstalled all of the HP PSP packages. I downloaded the latest version 8.7 HP PSP and installed it. HP has moved the installation instruction, so the link in my previous post no longer works. Now you can find the same instructions here:

http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/HOWTO-Guide-Instructions-Install-HP-PSP-On-CentOS-5/m-p/4652208/highlight/true#M103665

The new version of PSP actually works better than the older one: it works right after installation. No tweaking was required.

There is one thing, however: SNMP had to be re-configured. Before re-configuring SNMP, the management webpage on port 2381 showed a bunch of blank rectangles.

I checked the messages log in /var/log. It showed a couple of bogus errors:

  1. The “Network Manager” does recognize bonded network adapter type, and logs errors in the messages log. So I removed “TYPE=BOND” from the /etc/sysconfig/network-scripts/ifcfg-bond0. This line was inserted by the CentOS Network Manager GUI, and its absence has no effect on the adapter.
  2. The slave adapters of the bonded adapter, namely eth0 and eth1, had configurations that caused the “Network Manager” to log errors in the messages log. Although the slave adapters do not have IP addresses, the “Network Manager” looks for them in the configuration files anyway. Somehow it grabbed the MAC address lines and thought that they were the IP addresses. Since the MAC addresses start with “0”, the “Network Manager” logged errors saying that “IP addresses cannot start with 0”. Again since the CentOS Network Manager GUI put the MAC addresses in the configuration itself, I tried to remove them to see if the errors would go away. Well they did not. The “Network Manager” complained that “IP addresses were missing”, even though none are needed for the slave adapters. So I put the MAC addresses back and then set DHCP to be “on” for these slave adapters. These adapters will never go out and acquire IP addresses since they are inactive, but this made the “Network Manager” stop logging errors in the messages log.

On the “System Management Homepage” on port 2381, I wanted to test SNMP traps. Unfortunately the test does actually simulate any hardware trap that the HP SNMP Agents would take notice. Even though in /opt/hp/hp-snmp-agents/cma.conf we setup the trapmail line to send emails, none was sent when I clicked on the test button. Only when I unplugged one of the two network cables that generated traps that the HP SNMP Agent would take notice, I got 60 emails in a fraction of a second.

So the test button on the “System Management Homepage” SNMP settings page only tests notification mechanisms that are not really relevant to the HP SNMP Agents. In order to see the trap messages for this test, one needs to run “snmptrapd –f –Le <hostname>” on the command line, and watch the output on the screen. Please note that in the /etc/snmp directory, there needs to be a file named “snmptrapd.conf”, and it needs to contain the line “disableAuthorization yes”. Otherwise “snmptrapd” would not show anything. The “<hostname>” part must match exactly what is in the “snmpd.conf” file. Using IP address or other aliases will not work.

The HP SNMP monitoring is not very reliable. From time to time the “System Management Homepage” on port 2381 would show a bunch of empty box and indicate the system status is ok. After a reboot, all the boxes will be populated again and everything would go back to normal.

It looks like that we just need to watch this server very closely.

No comments:

Post a Comment