Tape drive problem.

From: Jim Fitzmaurice (jpfitz@fnal.gov)
Date: Thu Sep 26 2002 - 10:52:42 EDT


Managers,

    I have a 3 system cluster, consisting of one GS80 (Member1) and two
4100's (Member2, and Member3) running Tru64 v5.1 and TruCluster v5.1
PatchKit 5. I have one DLT4000 on each machine in the cluster and Member2 (a
4100) has an additional DLT8000 in a tape library which I use for backups.
This morning I came in and my backups had completely failed, why:

/dev/tape/tape3_d1: No such device or address

    The only unusual activity on the cluster occurred yesterday morning.
HP/Compaq support had determined our Memory Channel Adapter was bad on
Member3(a 4100). Backups were still running on Member2 when the Field
Engineer arrived to replace the board. I took down Member3 and we replaced
the board, backups continued to run normally on Member2, the cluster
remained up. After restoring Member3, I noticed a GB Ethernet Adapter was no
longer working on that machine. The FE ordered a replacement and about 90
minutes later it arrived and I brought Member3 down again, and we replaced
the GB Ethernet Adapter. Again the cluster continued to function and backup
continued to run normally. Member3 came back 100% this time, and shortly
after that backups ran to their normal conclusion on Member2.

    This morning however, backups failed and /dev/tape/tape3 is acting
weird.

    I can run "scu> show edt" and it sees the device:

 4 1 0 Sequential SCSI-2 QUANTUM DLT8000 0250 W

    I can run "hwmgr -view devices and it shows up there too:

493: /dev/ntape/tape3 QUANTUM DLT8000 bus-4-targ-1-lun-0

    And the device files exist as well:

crw-rw-rw- 1 root system 19,2002 Jul 30 17:00 /dev/tape/tape3
crw-rw-rw- 1 root system 19,2006 Jul 30 17:00 /dev/tape/tape3_d0
crw-rw-rw- 1 root system 19,2008 Sep 26 09:28 /dev/tape/tape3_d1
crw-rw-rw- 1 root system 19,2010 Jul 30 17:01 /dev/tape/tape3_d2
crw-rw-rw- 1 root system 19,2012 Jul 30 17:01 /dev/tape/tape3_d3
crw-rw-rw- 1 root system 19,2014 Jul 30 17:01 /dev/tape/tape3_d4
crw-rw-rw- 1 root system 19,2016 Jul 30 17:01 /dev/tape/tape3_d5
crw-rw-rw- 1 root system 19,2018 Jul 30 17:01 /dev/tape/tape3_d6
crw-rw-rw- 1 root system 19,2020 Jul 30 17:01 /dev/tape/tape3_d7
crw-rw-rw- 1 root system 19,2004 Jul 30 17:00 /dev/tape/tape3c

    However, I try to run any other command to actually access the drive and
it's not there:

> mt -f /dev/tape/tape3 status
/dev/tape/tape3: No such device or address
> tar -cvf /dev/tape/tape3_d1 /etc/brutab
tar: cannot open /dev/tape/tape3_d1: No such device or address
> dd if=/etc/brutab of=/dev/tape/tape3_d1
/dev/tape/tape3_d1: No such device or address

    All cables have been reseated, and the bus is properly terminated, I
tried rebooting the library and the drive. Rebooting Member2 or the entire
Cluster will not be feasible for at least a week. Nothing unusual in the
messages log, and the last error in the binary error log was yesterday
afternoon, and it was a disk error, wrong Target, different LUN, not this
device.

    I realize that something done to one Member of a cluster can effect the
other members, but I wouldn't think a couple reboot's to replace bad cards
could cause a device to behave like that. What could have happened? But more
importantly, does anybody know how I can fix it?

    Any help would be greatly appreciated.

James Fitzmaurice
D0 Online Systems Manager
Fermi National Accelerator Laboratory
(630) 840-4011
jpfitz@fnal.gov

UNIX is very user friendly, It's just very particular about who it makes
friends with.



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:54 EDT