Entire 7133-020 SSA drawer went missing. (Long)

From: Green, Simon (SGreen@KRAFTEUROPE.COM)
Date: Mon Jul 29 2002 - 13:42:19 EDT


I was doing some work last Friday which involved some SSA recabling. This
led to a serious problem which I have not been able to resolve. I would
appreciate any insights as to the cause and how to ensure it doesn't happen
again.

Sorry for the length and turgid nature of this post, but I wanted to avoid
missing important information.

The basic configuration was an SP2 Winterhawk II Wide node with two 6230 SSA
adapters, (no fast-write cache). Four separate loops, (two A, two B),
connecting to six 7133-020 drawers with a total of 89 disks of various
sizes.

One of the SSA drawers has a minor fault on it, which is causing it to
mistakenly report a cooling/power failure. The LEDs on the power supplies
are green, but there's an orange LED on the front. Link verification shows
eight of the disks, (9-16) with the "Power" status. The other 8 are "Good".
(These groups are in different loops.) An IBM engineer has confirmed that
the power/cooling units are functioning correctly. I had a similar problem
on a 7133-D40 last year, which proved to be a problem with the flex-cable to
the drawer's front panel, so I assume this is a similar fault.

My first task was to reorganise these loops slightly to split the load more
evenly across the two adapters. At the start, the A loop on ssa0 included 8
disks whilst that on the B loop included 32 disks. I planned to move 16
disks from ssa1 to ssa0, with an intermediate stage that would have all 40
disks in a single loop, using both adapters. The B loops on the two
adapters would not be affected. I had made some changes to the loop a
couple of weeks earlier: the two adapters used to be in the same loop. I
did not have any problems, although there were no filesystems mounted at
that time. (Some VGs were online, though.)

This is the sort of thing I have done many times in the past and after
checking my cable plans carefully I was confident of being able to do it
with no disruption to the system, (the application - SAP - would be left
running throughout this work).

I made my changes but when I went to check the system status I was horrified
to see hundreds of error logs for I/O errors, and some LVM errors indicating
corrupt filesystems.

My first thought was that I'd made an error in my cable changes and isolated
some disks.
lscfg showed all pdisks and hdisks as Available. I also ran maymap, which
showed the loops to be configured as I intended. lsvg -p showed all disks
as active.

Further investigation, however, showed that despite all of this, there were
still missing disks.

Diagnostics produced an SRN of 301C0, which indicated an error in an SSA
drawer. Continuing, diagnostics indicated that some disks were missing.
Cross referencing these with my documentation I determined that an entire
drawer of 16 disks had gone missing, (although the disks still showed as
available!). This was NOT the drawer with the false power error.

Going through the SSA service aids Link Verification, I could see that those
16 pdisks were listed for the relevant adapter, but had no status against
them. The eight other disks are the ones in the back of the faulty drawer
mentioned above, and still showed a status of "Power"; the disks in the B
loop showed as "Good".

In order to fix things, I first of all tried cfgmgr. This didn't produce
any errors but Link Verification still failed to show a status for the 16
disks. I then shutdown the node and powered off all three drawers attached
to ssa1. After a short time I powered the drawers back on. The faulty
drawer still had its orange LED on the front, but so did the one which had
gone missing, (it had not shown an orange LED at any time before).

After another minute or two, I started up the node. During its boot, the
orange LED on the second drawer went out. Once the boot was complete Link
Verification showed the 16 disks were back, with a status of "Good". We
were able to re-start the application and Oracle recovered the database
without problems. Further work was abandoned.

Today, I've been looking at the system. The adapter microcode is AB00,
which is about a year old. Current version is B800, so I'll get that
updated when I can but I don't think it's a problem: all of the changes are
for new function, not corrective.

Looking at the disk microcode, I see something odd.

lscfg -cl pdiskX gives me all of the information I'd expect. But if I
display the disk microcode levels through the service aids, the sixteen
disks which went missing simply show "????" instead of the appropriate
level.
"ssadload -s -a ssa1" fails to show those 16 disks, but shows the other
disks in the A loop and all disks in the B loop
"ssadload -s", produces 16 error messages - "ssadload: System call failed".

>From the lscfg output, 14 of the disks are at the latest level.
One is at 9901 and should be at 9911.
One is at 0048 and should be at 9911.

I seem to recall having problems when I updated the disk microcode on this
system last year: when I updated all disks, (through the service aids) some
disks weren't updated and I had to go back and update them individually. I
don't know if it was these disks, but it seems quite likely, under the
circumstances.

OK; that's everything I can think of. I await your comments with bated
breath.

Simon Green
Philip Morris ITSC Europe

AIX-L Archive at http://marc.theaimsgroup.com/?l=aix-l&r=1&w=2
AIX FAQ at http://www.faqs.org/faqs/aix-faq/

N.B. Unsolicited email from vendors will seldom be appreciated.



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:16:05 EDT