Strange SSA failure

From: Green, Simon (Simon.Green@EU.ALTRIA.COM)
Date: Thu May 22 2003 - 07:08:44 EDT


I had a problem with some SSA hardware yesterday. It's all resolved, but it
was rather strange: something which I've never seen before. I wondered if
any of you have come across this sort of thing.

The system is an E20 running 4.3.3 ML06. There's one 6215 adapter. Only
the A ports are used, and these have a simple loop to the back half of a
7133-020. This contained four 9GB and four 4.5GB disks. No RAID.

Adapter microcode is the latest level. Code on the 9GB disks is slightly
out of date. Code on the 4.5GB disks is mixed, (as are the disks!); two of
them are slightly out of date, (same as the 9.1s), one is very old and the
other is somewhere in the middle.

All of the device filesets are present and correct. The current version was
used when the adapter was installed a month or two ago and everything's been
running fine since then.

Yesterday, there was an error on one of the SCSI disks attached to this
system. Nothing special and easily taken care of. I ran diagnostics to
confirm that all was OK and was worried to see that when it reached the SSA
pdisks the diagnostics were taking a very long time - 10-15 minutes. No
errors were thrown up at that time but at the end, all eight disks were
shown in error, with SRN d0300. I don't know if there was a problem before
I ran diagnostics. Nobody had complained and there hadn't been any errors.

Obviously with all of the disks in error I suspected an adapter fault,
especially as this was a second-hand adapter which we'd acquired from one of
our other sites. Diagnostics on the adapter just gave SRN SSA02, which
isn't a lot of help!

I went over to the site with a spare card but I didn't want to replace it
without confirming the cause of the problem.

First step was a reboot in case it was some sort of adapter hang. This took
ages, spending a lot of time on LED 80c as it tried and failed to configure
the disks.

Once the system was back, I deleted all the SSA devices - including ssa0 and
ssar. I then disconnected the ssa cables and ran cfgmgr. The adapter was
configured OK and diagnostics revealed no errors.

I then added half of the disks back and ran cfgmgr -l ssar. They were
picked up OK and all of the diagnostics were OK, including link verification
and configuration verification.

I removed those four disks and swapped the cables over to the other four
disks. This time, cfgmgr took much longer to run - several minutes rather
than a few seconds - and didn't configure any disks.

The LEDs on the back of the disks were doing weird things! I had slots
13-16. 14 and 15 looked normal. 16 had a power LED but no connection LED.
13 had all three LEDs! The amber LED was going on and off erratically: not
really flashing.

It seemed that either the SSA drawer or one of these four disks was faulty,
so I decided to remove them one at a time and see what happened. Starting
at the end, I removed the disk in slot 16 - one of the 9.1GB disks - and the
LEDs on the other disks stabilised: two nice, steady green lights, (I'd put
in a dummy disk in slot 16).

Another cfgmgr and I had seven disks showing up and everything was OK.

Anybody seen this sort of thing? I've been working with SSA disks for
several years and I've never seen anything like this before.

Simon Green
Altria ITSC Europe s.a.r.l.

AIX-L Archive at http://marc.theaimsgroup.com/?l=aix-l&r=1&w=2
AIX FAQ at http://www.faqs.org/faqs/aix-faq/

N.B. Unsolicited email from vendors will not be appreciated.



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 22:16:51 EDT