Problems with 100% CPU usage after 4.0f to 5.1a upgrade

From: McCracken, Denise (Denise.McCracken@misyshealthcare.com)
Date: Mon Apr 26 2004 - 16:42:34 EDT


        We have a 4100 running MUMPS database that was upgraded from 4.0f to
5.1a last week. Firmware is 6.0, which looks like the latest for the 4100,
and we applied patch kit 4. We have upgraded other 4100's and never had a
problem. This system also ran for years on 4.0f with no problems.

        Now, every morning, MUMPS hangs and has to be forced down. Before
and after the hang, there is 100% CPU usage. The jobs that are tying it up
are cumulative reports, which are sent to printers on the network. The OS
does not actually hang, nor does the machine crash, although we did try
rebooting it once, which didn't help. After the hang, when we have forced
MUMPS down, the jobs kick off again, CPU usage goes to 100%, then all jobs
complete at once and CPU goes down to 90% idle.

        We have checked the error logs and can find only two things that
might be related to this problem. One is a string of network messages about
cards using address 255.255.255.255. The other are some machine check
errors, but the problem has gone on since last week, and the machine check
errors happened today only.

        The network errors, from /var/adm/messages:

Apr 26 08:31:00 osflab vmunix: arp: illegal IP address 255.255.255.255 is
used b
y hardware address 00-80-64-24-7C-0C!
Apr 26 09:51:59 osflab vmunix: arp: illegal IP address 255.255.255.255 is
used b
y hardware address 00-80-64-24-95-61!
Apr 26 10:13:15 osflab vmunix: arp: illegal IP address 255.255.255.255 is
used b
y hardware address 00-80-64-24-7D-D3!
Apr 26 12:54:25 osflab vmunix: arp: illegal IP address 255.255.255.255 is
used b
y hardware address 00-80-64-24-A8-7B!

        The machine check errors, from today.

**** V3.3 ********************* ENTRY 2 ********************************
 
 
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 469.
Timestamp of occurrence 25-APR-2004 20:01:10
Host name osflab
 
System type register x00000016 Alpha 4000/1200 Series
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000
 
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. Machine Check Error - (major class)
                                  3. - (minor class)
 
 
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number NI8260FT4D
Module Serial Number
Module Type x0000
System Revision x00000000
 
Machine Check Reason x0204 IOD Detected Soft Error
 
Ext Interface Status Reg x0000000000000000
                                     Register Contents Not Valid For This
Error
Ext Interface Address Reg x0000000000000000
                                     Register Contents Not Valid For This
Error
Fill Syndrome Reg x0000000000000000
                                     Register Contents Not Valid For This
Error
Interrupt Summary Reg x0000000000000000
                                     Register Contents Not Valid For This
Error
WHOAMI x00000000 Register Contents Not Valid For This
Error
 
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000FBE0000000
                                     IOD# 1
Dev Type & Rev Register x06000332 CAP Chip Revision: x00000002
                                     B3040 Revision: x00000003
                                     B3050 Revision: x00000003
                                     AlphaServer 4100
MC Error Info Register 0 x12C41940
                                     MC Bus Trans Addr<31:4>: 12C41940
MC Error Info Register 1 x800E8800 MC bus trans addr <39:32> x00000000
                                     MC Command is Read0-Mem
                                     CPU0 Master at Time of Error
                                     Device ID: x00000002
                                     MC error info valid
CAP Error Register x88000000 Correctable ECC err det by MDPA
                                     MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
 
PALcode Revision Palcode Rev: 1.23-2
 

        Any suggestions on where to go with this?

thanks

-denise

"Customer service may be the only way that a

company can distinguish itself from its
competition these days." -H. Frank Gibbard

Denise McCracken, Systems Software Specialist
Misys Healthcare Systems, Tucson, AZ

Certified Tru64 v5 Systems Administrator
Comptia Network+ Certified Professional



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:57 EDT