UPDATE: 100% busy on SAN based disk

From: Johan Hartzenberg (jhartzen@csc.com)
Date: Thu Oct 03 2002 - 13:10:50 EDT


I have had some discussion with a small number of persons from this group,
as well as with Sun, JNI, Inrange, HDS and MGX - the vendors involved.

To date I have not found a resolution.

I have however found several things:
Setting sd_max_throttle and sd_io_time does not seem to make a difference.

The problem seems to be most prevalent on domains configured with the JNI
FCE2-1063 card (by a large margin)

Since putting each host in it's own dedicated little zone on the Fabric the
problem seems less prevalent (No incidents in 4 days)

This has happened about 20 times since my original post last month. We
blew our service levels of course, and both Sun and JNI is of no help right
now.

Sun engineers are suggestion that it is because I can see 4 paths to each
LUN ( The LUN is on 2 ports on the HDS and the Host has got dual HBAs) -
Sun suggests that Veritas may have a problem seeing more than 2 paths to a
LUN. Sun also suggests that we have too many hosts per HDS port (4 HDS
ports, each port with several LUNs for all of the 23 hosts).

Following up on this lead we have moved one host off onto it's own
dedicated set of ports.

The messages file errors are still appearing (SCSI transport errors, vxdmp
path disable messages, and jnic target fail messages)

There may be an issue with JNI cards receiving multiple RSCNs in rapid
succession, an issue the latest JNI driver addresses. This is something I
will test.

  _Johan

                                                                                                                                   
                    Johan
                    Hartzenberg/G To: sunmanagers@sunmanagers.org
                    IS/CSC cc:
                    @CSC Subject: 100% busy on SAN based disk
                    Sent by:
                    sunmanagers-a
                    dmin
                                                                                                                                   
                                                                                                                                   
                    16/09/2002
                    02:44 PM
                                                                                                                                   
                                                                                                                                   

Hi,

I have some 35 odd systems connected to a SAN for access to disk. Once in
a while one of these system goes into a state where it starts to respond
extremely slow.

iostat -xnc 5

Will show 100 under the %b column, and you get very high io-wait, avg
service times goes to 15 seconds and more, etc. Generally bad.

If I run
drvconfig
devlinks; disks
vxdctl enable

Then almost immediately the io will normalise and everything goes back to
normal. This has happened about 5 - maybe a few more times, in the past 3
weeks.

This is on Solaris 2.6 with current patches, using JNI HBAs and accessing
HDS 9960 based disk.

Also on these same san connected systems, I sometimes see something much
more severe. A system will go into a state where all commands which access
the disk (eg find /sandisk1) will hang up completely and can not be
interrupted or even killed. When this happens, the only way to get things
to work again is with uadmin, reboot doesn't work, nor does halt. After a
reboot everything seems perfectly normal! (So far this has happened 3
times, one time on each of 3 separate systems, all in the past 10 days)

If you have seen anything similar please let me know your findings!

Thanx,
  _Johan
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:25:02 EDT