[HPADM] SUMMARY, seeking exercises for replaced hardware

From: Mark Fearer (mfearer@cox.net)
Date: Wed Oct 05 2005 - 06:41:11 EDT

Next message: Ben Le: "[HPADM] HP Technology Forum Conference"
Previous message: johnny cochran: "[HPADM] coda template"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hello, list. Thanks to all who responded. Several expressed interest in
seeing a summary, so here goes. Also, I want to apologize if the wording of
my original question was vague, as I received many responses about how to
re-engineer my data center. Unfortunately, I am not in a position to do
that.

Original question:

Admins. Hello. I am in a data center that has roughly 200 HP servers. They
are A, D, K, L and N class and are running 11.0 or 11.11. They are mission
critical and require an uptime of 19x7 with a maintenance window from
midnight to 5:00am. We are constantly replacing failed hardware.
I wish to ask of the list some good verification exercises for testing
replaced hardware. Specifically:

hard drives - ioscan? diskinfo?
memory - stm?
cpu - stm? anything in the BCH?
tape devices - tar? ioscan? stm?
mother boards - firmware?
firmware - GSP? BCH?
fiber channel cards - ioscan? fcmutil?

Again, I am asking if anyone recommends some good testing for replaced
hardware mentioned above

Summary:

Disks

Memory

CPU

Tape devices

Network cards (NICs)

Mother boards

Firmware (GSP, PDC)

Fiber channel cards

DISKS

commands: ioscan, diskinfo, pvdisplay, vgdisplay, lvdisplay, lvlnboot,
setboot, stm, cstm, dmesg

files: /etc/fstab, /etc/lvmtab, /etc/lvmconf/vg##.conf, syslog.log, etc.

If a disk is replaced, two good commands for verifying that the system sees
the new disk are 'diskinfo' and 'ioscan -fnC disk'.

Doing an 'ioscan -fnC disk' command first should display all of the storage
devices that are attached to a server. You should see the H/W Type as
'DEVICE' and the S/W State as 'CLAIMED'. You should also see the H/W Path
and the device path displayed.

Example:

[root@sparky: ]# ioscan -fnC disk

Class I H/W Path Driver S/W State H/W Type Description

============================================================

disk 0 0/0/1/1.15.0 sdisk CLAIMED DEVICE HP
36.4GST336753LC

/dev/dsk/c1t15d0 /dev/rdsk/c1t15d0

disk 1 0/0/2/1.15.0 sdisk CLAIMED DEVICE HP
36.4GST336753LC

/dev/dsk/c3t15d0 /dev/rdsk/c3t15d0

A 'diskinfo' command followed by the raw character device path should give
you the basic technical specifications of the disk.

Example:

[root@sparky: ]# diskinfo /dev/rdsk/c1t15d0

SCSI describe of /dev/rdsk/c1t15d0:

vendor: HP 36.4G

product id: ST336753LC

type: direct access

size: 35566480 Kbytes

bytes per sector: 512

You can test a disk for bad sectors and read errors with the following
exercise:

dd in=/dev/rdsk/<your_disk> of=/dev/null bs=64k

Example:

[root@sparky: ]# dd if=/dev/rdsk/c1t15d0 of=/dev/null bs=64k

555726+1 records in

555726+1 records out

Another exercise that will test your disk is 'stm'. From the command line,
enter 'stm'. This will invoke a GUI or a TUI. From this menu, you can
highlight your device, then choose 'Tools', 'Exercise', 'Run'.

Example:

If you are getting replacement parts from a vendor, I would try and make
sure you can burn them in somewhere for at least a couple of days. You
might also look into keeping track of any serial numbers or unique ids on
the parts.

MEMORY

commands: stm, cstm, dmesg, top, glance

files: /var/adm/messages

An exercise that will test your memory is 'stm'. From the command line,
enter 'stm'. This will invoke a GUI or a TUI. From this menu, you can
highlight your device, then choose 'Tools', 'Exercise', 'Run'.

Example:

Another exercise that will test your memory is 'top'. From the command
line, enter 'top'. This will invoke a text output graph of the top processes
that are using real time resources and shows the current usage of memory and
cpu.

Example:

top

System: sparky Mon Oct 3 15:47:03
2005

Load averages: 0.16, 1.03, 0.88

167 processes: 148 sleeping, 19 running

Cpu states:

CPU LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS

0 0.15 0.2% 0.0% 0.2% 99.6% 0.0% 0.0% 0.0% 0.0%

1 0.18 0.2% 0.0% 0.0% 99.8% 0.0% 0.0% 0.0% 0.0%

--- ---- ----- ----- ----- ----- ----- ----- ----- -----

avg 0.16 0.2% 0.0% 0.2% 99.6% 0.0% 0.0% 0.0% 0.0%

Memory: 147732K (89772K) real, 231180K (143120K) virtual, 957008K free
Page# 1/

CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU
COMMAND

0 ? 37 root 152 20 6464K 6464K run 0:14 0.43 0.43 vxfsd

1 ? 2111 root 152 20 14804K 4040K run 0:01 0.20 0.20 dmisp

0 ? 3755 root 152 20 37980K 6944K run 0:13 0.18 0.18
mhragent

Still another exercise that will test your memory is Measureware's 'glance'.
>From the command line, enter 'glance'. This will invoke a text output graph
of the top processes that are using real time resources and shows the
current usage of memory and cpu.

Example:

B3692A GlancePlus C.03.72.00 16:14:43 sparky 9000/800 Current Avg
High

----------------------------------------------------------------------------

----
CPU  Util   SA                                                 |  3%  3%
3%
Disk Util   FV                                                 |  3%  3%
3%
Mem  Util   S           SU   UB        B                       | 54%   54%
54%
Swap Util   U     URR                                          | 16%   16%
16%
----------------------------------------------------------------------------
----
                                  PROCESS LIST                      Users=
2
                              User      CPU Util     Cum     Disk
Thd
Process Name   PID   PPID Pri Name   (  200% max)    CPU   IO Rate    RSS
Cnt
----------------------------------------------------------------------------
----
glance        8899   8018 168 root      4.0/ 4.6     0.1  6.0/30.9   6.6mb
1
swapper          0      0 127 root      0.0/ 0.0     1.3  1.3/ 0.2    32kb
1
vxfsd           37      0 138 root      0.0/ 0.1    17.9  1.3/ 9.6   6.3mb
16
 
 
 
CPU
 
commands: stm, cstm, top, glance
 
You might consider opening a several windows and running 
recursive jobs such as 'find', 'll -R', 'cp'.  You can generate good cpu
loads and make sure you are getting all cpu's by monitoring with a 'top' or
a 'glance'.
 
Example of a 'top':
System: sparky                                        Tue Oct  4 13:29:04
2005
Load averages: 0.00, 0.02, 0.04
183 processes: 163 sleeping, 20 running
Cpu states:
CPU   LOAD   USER   NICE    SYS   IDLE  BLOCK  SWAIT   INTR   SSYS
 0    0.00   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
 1    0.00   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
---   ----  -----  -----  -----  -----  -----  -----  -----  -----
avg   0.00   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
 
Memory: 159532K (94984K) real, 244976K (146336K) virtual, 879716K free
Page# 1/
17
 
CPU TTY    PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU  %CPU
COMMAND
 1   ?      37 root     152 20  6464K  6464K run      1:04  0.31  0.31 vxfsd
 
Example of a 'glance':
B3692A GlancePlus C.03.72.00    13:31:52   sparky 9000/800    Current  Avg
High
----------------------------------------------------------------------------
----
CPU  Util   S                                                  |  1%    1%
2%
Disk Util   F                                                  |  2%    2%
2%
Mem  Util   S             SU   UB        B                     | 58%   58%
58%
Swap Util   U     URR                                          | 17%   17%
17%
----------------------------------------------------------------------------
----
                                  PROCESS LIST                      Users=
5
                              User      CPU Util     Cum     Disk
Thd
Process Name   PID   PPID Pri Name   (  200% max)    CPU   IO Rate    RSS
Cnt
----------------------------------------------------------------------------
----
glance       15070  14158 154 root      1.1/ 2.3     0.2  0.1/ 9.4   6.8mb
1
swapper          0      0 127 root      0.0/ 0.0     2.7  1.3/ 0.2    32kb
1
vxfsd           37      0 138 root      0.0/ 0.1    63.8  4.0/ 3.8   6.3mb
18
 
Of course, there is 'stm', too, for stress testing a CPU.
 
Example:
 
You can disable a CPU by rebooting and pausing at the firmware prompts. You
can then go to 
the COnfigure section and enable or disable CPU's from the firmware.
 
If you are getting replacement parts from a vendor, I would try and make
sure you can burn them in somewhere for at least a couple of days.  You
might also look into keeping track of any serial numbers or unique ids on
the parts.
 
 
 
TAPE DEVICES
 
Commands: ioscan, stm, tar
 
Upon replacement of a tape device, do an 'ioscan -fnC tape' or an 'ioscan
-fn' to assure that the new device is seen on the bus and that it is
claimed.
 
Example:
[root@smith: ]# ioscan -fnC tape
Class     I  H/W Path     Driver S/W State   H/W Type     Description
=====================================================================
tape      0  0/0/1/0.3.0  stape CLAIMED     DEVICE       HP      C1537A
                         /dev/rmt/0m            /dev/rmt/c0t3d0BESTn
                         /dev/rmt/0mb           /dev/rmt/c0t3d0BESTnb
                         /dev/rmt/0mn           /dev/rmt/c0t3d0DDS
                         /dev/rmt/0mnb          /dev/rmt/c0t3d0DDSb
                         /dev/rmt/c0t3d0BEST    /dev/rmt/c0t3d0DDSn
                         /dev/rmt/c0t3d0BESTb   /dev/rmt/c0t3d0DDSnb
 
'ioscan' will show you if it's on the bus, but won't tell you about the 
condition of the heads.  Do some write/read to the tape. 
 
Do some 'tar' tests. For example, you may want to search the /tmp directory
for some medium sized file to 'tar' to (cvf) and then extract from (xvf) the
new tape device. 
 
Example:
[root@smith: tmp]# cd /tmp
[root@smith: tmp]# ll
-rw-r--r--   1 oracle     dba          42350 May  8 10:05 DB_Check.sql
 
[root@smith: tmp]# tar cvf /dev/rmt/0m DB_Check.sql
a DB_Check.sql 83 blocks
 
Check the 'tar' to tape with a 'tvf' (table of contents).
 
[root@smith: tmp]# tar tvf /dev/rmt/0m
rw-r--r-- 103/102  42350 May  8 10:05 2005 DB_Check.sql
 
Compare the original file with the file that was tarred to and then
extracted from tape with use of the 'cksum' command.
 
[root@smith: tmp]# pwd
/tmp
[root@smith: tmp]# mkdir test
[root@smith: tmp]# cd test
[root@smith: test]# tar xvf /dev/rmt/0m
x DB_Check.sql, 42350 bytes, 83 tape blocks
 
[root@smith: test]# cksum DB_Check.sql
4182244438 42350 DB_Check.sql
[root@smith: test]# cd ..
[root@smith: tmp]# cksum DB_Check.sql
4182244438 42350 DB_Check.sql
 
And, of course, there is 'stm' which can be used to stress test a tape
device. Assure that you have a tape loaded that can be written to (with no
valuable data on it, as it will be overwritten).
 
Example:
 
There may be cases of 'tar' failing with an error message 'end of tape
reached' in a few seconds from starting the backup. It would happen even
with a fresh tape. Eventually I found out that the tape drive (DDS3) wasn't
recognizing only 125m tapes (which it has always been recognizing). The same
works fine with 90 or 120m tapes.
 
If you are getting replacement parts from a vendor, I would try and make
sure you can burn them in somewhere for at least a couple of days.  You
might also look into keeping track of any serial numbers or unique ids on
the parts.
 
 
 
NETWORK CARDS (NICS)
 
Commands: lanadmin, lanscan, netstat -in
 
Find the driver:
'ioscan -fkC lan' will display the btlan driver loaded in the kernel.
 
Example:
[root@sparky: tmp]# ioscan -fkC lan
Class     I  H/W Path  Driver S/W State   H/W Type     Description
==================================================================
lan       0  0/0/0/0   btlan CLAIMED     INTERFACE    HP PCI 10/100Base-TX
Core
lan       1  0/2/0/0   btlan CLAIMED     INTERFACE    HP A5230A/B5509BA PCI
10/100Base-TX Addon
lan       2  0/4/0/0   btlan CLAIMED     INTERFACE    HP A5230A/B5509BA PCI
10/100Base-TX Addon
 
'lanadmin'
 
Example:
[root@sparky: tmp]# lanadmin -x 1
Current Config                   = 100 Full-Duplex MANUAL
 
'lanscan'
 
Example:
[root@sparky: tmp]# lanscan
Hardware Station        Crd  Hdw   Net-Interface    NM   MAC       HP-DLPI
DLPI
Path     Address        In#  State NamePPA          ID   Type      Support
Mjr#
0/0/0/0  0x00306E1CC054 0    UP    lan0 snap0       1    ETHER       Yes
119
LinkAgg0 0x001083F67C58 900  UP    lan900 snap900   5    ETHER       Yes
119
LinkAgg1 0x000000000000 901  DOWN  lan901 snap901   6    ETHER       Yes
119
LinkAgg2 0x000000000000 902  DOWN  lan902 snap902   7    ETHER       Yes
119
LinkAgg3 0x000000000000 903  DOWN  lan903 snap903   8    ETHER       Yes
119
LinkAgg4 0x000000000000 904  DOWN  lan904 snap904   9    ETHER       Yes
119
 
'netstat -in'
 
Example:
[root@sparky: tmp]# netstat -in
Name      Mtu  Network         Address         Ipkts   Ierrs Opkts   Oerrs
Coll
lo0       4136 127.0.0.0       127.0.0.1       2935    0     2935    0     0
lan900    1500 10.112.151.0    10.112.151.142  43284362 25    16608331 0
0
 
 
If you use APA (Auto Port Aggregation), the command to display the output of
APA is a 'lanscan -q'.
 
Example:
[root@sparky: tmp]# lanscan -q
0
900        1 2
901
902
903
904
 
You can also test the speed of the network connection by choosing a medium
sized file and doing a secure copy to some remote host.
 
Example:
[root@sparky: tmp]# scp sparky-2.log etc-k09:/tmp
Password:
sparky-2.log                      100%  203KB   3.2MB/s   00:00
 
Driver Configuration file. Use this to match the lan found with a 'lanscan'
or a 'netstat -in' with its proper driver.
 
 name in /etc/rc.config.d/
 Driver  duplex file     Description                     Software
 ----------------------------------------------------------------
 btlan    hpbtlanconf
 btlan0  hpeisabtconf
 btlan1  hpbasetconf
 btlan3  hpbase100conf   Internal (core) LAN card
 btlan4  hpgsc100conf
 btlan5  hppci100conf    "external" PCI card             A5230A
 btlan6  hpsppci100conf  V-class PCI lan card
 gelan   hpgelanconf     GigaBit card                    A4926A
 
If you are getting replacement parts from a vendor, I would try and make
sure you can burn them in somewhere for at least a couple of days.  You
might also look into keeping track of any serial numbers or unique ids on
the parts.
 
 
 
MOTHER BOARDS
 
log files on /var/adm/diag 
 
Regardless of which vendor might replace a motherboard, HP will most likely
have to be present to flash proprietary firmware and set the CPU ID.
 
The cpu tests would probably get most of your mother board cache issues.
 
If you are getting replacement parts from a vendor, I would try and make
sure you can burn them in somewhere for at least a couple of days.  You
might also look into keeping track of any serial numbers or unique ids on
the parts.
 
 
 
FIRMWARE (GSP, PDC)
 
You can cpu firmware at boot and upgrade as needed.  'stm' can often give 
you firmware on other components.
 
Here is an example of a GSP firmware upgrade:
 
The procedure for upgrading GSP Firmware (from B.02.07 to B.02.20):
 
1) If Firmware is older than B.02.15 then it must be updated in step
sequence to revision B.02.15 first before it can be updated to B.02.20.
Install GSP Firmware B.02.15 patch PF_CCANGSPB0215  (best done by HP CE as
laptop needed to set up private network to talk to GSP card)
 
2) While HP-UX patches are not technically required they are highly
recommended to be installed at B.02.15 update (network patch requires a
reboot).
 
For 11.00:
  1. PHNE_27393, Cumulative Mux and Pty Patch, superseded by PHNE_27788,
then by PHNE_29519 
  2. PHCO_27370, mksf cumulative patch 
 
For 11.11:
  1. PHNE_26326, Cumulative Mux and Pty Patch, superseded by PHNE_32616 
  2. PHCO_27243, mksf cumulative patch, 7/23/2002, superseded by PHCO_32202 
 
3) Install A6696B GSP Firmware B.02.20 patch PF_CCANGSPB0220  (5/19/2003)
 
4) Server does not have to be shut down during the GSP Firmware update
itself, but at the end of the GSP Firmware update the server needs to be
rebooted in order for PDC Online Diagnostics to communicate again with GSP
(using new GSP Firmware version).
 
Here is an example of a PDC firmware upgrade:
 
Upgrading PDC Firmware (requires reboot):
 
For the RP5430 (L1500) and RP5470 (L3000):
 
Install Firmware 44.12 patch PF_CARIW4412, which is a user-installable
patch:
 
For 11.00:
  PHSS_30631, Warning, superseded by PHSS_30839 (9/15/2004)
 
For 11.11:
  PHSS_30632 (4/23/2004)
 
For the RP7405, RP7410 and RP8400:
 
Install Firmware 6.4 [17.009] (from current 5.0 [16.011]) patch
PF_CKEYMAT0604, (created 4/4/2005)
 
 
 
FIBER CHANNEL CARDS
 
commands: ioscan, fcmsutil, dmesg
 
The 'fcmsutil' command is a diagnostic tool to be used for the TACHYON
A3740A, A3591A, A3404A, A3636A and TACHYON TL A5158A, A6684A, A6685A,
TACHYON XL2 A6795A, DUAL PORT A6826A and FC/GigE Combo A9782A/A9784A
Fiber Channel Host Bus Adapters. This command provides the ability to
perform Fibre Channel Test and Echo functionality, read the card's
registers, etc. This command requires the use of a device file to indicate
the interface over which the requested command needs to be performed.
 
Examples (from the man page):
Print a short listing of the statistics maintained by the driver, with
/dev/td1 as the device file.
 
     fcmsutil /dev/td1 stat -s
 
Send 5 echo packets of 200 bytes each to a remote N_Port with loop_id 4,
with /dev/td1 as the device file
 
           fcmsutil /dev/td1 echo -l 4 200 5
 
Print a short listing of the statistics of the device whose
remote-N-Port-ID is 0x02ae4 and with /dev/td1 as the device file
 
           fcmsutil /dev/td1 devstat 0x02ae4
 
Clear the device statistics of the device whose wwn is 0x100000e002219f45
and with /dev/td1 as the device file
 
           fcmsutil  /dev/td1 clear_devstat -w 0x100000e002219f45
 
Perform a Internal loopback test, sending 1000 packets with /dev/fcd2 as the
device file
 
           fcmsutil /dev/fcd2 lb int crpat 1000
 
You can use 'stm' to test a fiber card.
 
Example:
 
If you are getting replacement parts from a vendor, I would try and make
sure you can burn them in somewhere for at least a couple of days.  You
might also look into keeping track of any serial numbers or unique ids on
the parts.
 
 
 
Q4 Procedure for Analyzing System Crashes
 
 
 
Fault codes
 
            K class
 
The initial FAULT CBF0 means "High PRiority Machine Check occurred".
 
FLT 5007 means "Directed Error" on the Processor Memory Bus, with a =
Master ID of 0 (implies CPU 0) and processor memory bus 0. 
 
FLT 5408 means "Broad Error", master ID of 4 means on IOA 0 (system =
board).
 
FLT 5508 means "Broad Error", master ID of 5 means on IOA 1 (system =
board).
 
FLT CBFC means "PDCE HPMC processiung complete, failed to branch to OS =
HPMC, halt CPU", ACTION: FOllow the action for other HPMC codes beginning
with numerics
 
FLT CBF1 means "OS did not replace IVA". ACTION: FOllow the action for =
other HPMC codes beginning with numerics

--
             ---> Please post QUESTIONS and SUMMARIES only!! <---
        To subscribe/unsubscribe to this list, contact majordomo@dutchworks.nl
       Name: hpux-admin@dutchworks.nl     Owner: owner-hpux-admin@dutchworks.nl
 
 Archives:  ftp.dutchworks.nl:/pub/digests/hpux-admin       (FTP, browse only)
            http://www.dutchworks.nl/htbin/hpsysadmin   (Web, browse & search)

Next message: Ben Le: "[HPADM] HP Technology Forum Conference"
Previous message: johnny cochran: "[HPADM] coda template"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 11:02:50 EDT